<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Berik Ashimov&apos;s Blog</title><description>Thoughts on infrastructure, reliability, and engineering from a Senior IT Engineer with 15+ years of experience.</description><link>https://ashimov.com/</link><language>en-us</language><copyright>Copyright 2026 Berik Ashimov</copyright><managingEditor>berik@ashimov.com (Berik Ashimov)</managingEditor><webMaster>berik@ashimov.com (Berik Ashimov)</webMaster><ttl>60</ttl><item><title>NX-OS Spine/Leaf Operations: vPC, Port-Channels, and Pre-Production Checks</title><link>https://ashimov.com/posts/cisco-nexus-spineleaf/</link><guid isPermaLink="true">https://ashimov.com/posts/cisco-nexus-spineleaf/</guid><description>Operate Nexus spine/leaf fabrics without surprises. Covers vPC operational checks, port-channel hygiene, OSPF/BGP underlay verification, and failure drills before go-live.</description><pubDate>Tue, 17 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Fabric is up. vPC formed. Port-channels bundled. Then a link fails, and traffic blackholes. Or a leaf reboots, and half the servers lose connectivity. Or ECMP doesn&apos;t balance as expected.&lt;/p&gt;
&lt;p&gt;Spine/leaf sounds simple — until failure scenarios reveal configuration gaps. The time to discover these is before production, not during an outage.&lt;/p&gt;
&lt;h2&gt;vPC as an Operational Object&lt;/h2&gt;
&lt;h3&gt;What vPC Actually Is&lt;/h3&gt;
&lt;p&gt;vPC (Virtual Port Channel) makes two Nexus switches appear as one to downstream devices:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;        [Spine 1]     [Spine 2]
           │              │
     ┌─────┴──────────────┴─────┐
     │                          │
  [Leaf 1]──vPC Peer Link──[Leaf 2]
     │                          │
     └──────────┬───────────────┘
                │
           [Server]
           (port-channel)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server sees one port-channel to one &quot;switch.&quot; In reality, half the links go to Leaf 1, half to Leaf 2.&lt;/p&gt;
&lt;h3&gt;Critical vPC Components&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! vPC Domain configuration
vpc domain 1
  peer-switch
  peer-keepalive destination 10.0.0.2 source 10.0.0.1
  peer-gateway
  layer3 peer-router
  auto-recovery
  delay restore 120
  delay restore interface-vlan 60
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key elements:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Failure Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Peer-link&lt;/td&gt;
&lt;td&gt;Sync MAC tables, forward orphan traffic&lt;/td&gt;
&lt;td&gt;vPC suspends on secondary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peer-keepalive&lt;/td&gt;
&lt;td&gt;Detect peer failure&lt;/td&gt;
&lt;td&gt;Split-brain if both fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peer-gateway&lt;/td&gt;
&lt;td&gt;Allow peer to route for other&apos;s HSRP MAC&lt;/td&gt;
&lt;td&gt;Traffic blackhole&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-recovery&lt;/td&gt;
&lt;td&gt;Re-enable vPC after split-brain&lt;/td&gt;
&lt;td&gt;Manual intervention needed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;vPC Health Checks&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Overall vPC status
show vpc

# Expected output:
# vPC domain id         : 1
# Peer status           : peer adjacency formed ok
# vPC keep-alive status : peer is alive
# Configuration consistency status : success
# Per-vlan consistency status     : success
# Type-2 consistency status       : success
# vPC role              : primary

# Peer-link status
show vpc peer-link

# vPC consistency check
show vpc consistency-parameters global
show vpc consistency-parameters interface port-channel 10
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Consistency Check Failures&lt;/h3&gt;
&lt;p&gt;vPC requires certain configs to match on both peers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check what&apos;s inconsistent
show vpc consistency-parameters global

# Type 1 (must match or vPC won&apos;t form):
# - STP mode, VLAN state, port-type
# - vPC domain settings

# Type 2 (warning, vPC still works):
# - VLAN configurations
# - IGMP snooping settings
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix pattern&lt;/strong&gt;: Compare configs side by side:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On both switches
show run vpc
show run interface port-channel X

# Look for differences in:
# - allowed VLANs
# - switchport mode
# - STP settings
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Port-Channel Hygiene&lt;/h2&gt;
&lt;h3&gt;LACP Configuration&lt;/h3&gt;
&lt;p&gt;Always use LACP, never static:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Server-facing port-channel (vPC)
interface port-channel 10
  description Server-Cluster-01
  switchport mode trunk
  switchport trunk allowed vlan 100-110
  vpc 10

interface Ethernet1/1
  description Server-Cluster-01-Link1
  switchport mode trunk
  switchport trunk allowed vlan 100-110
  channel-group 10 mode active

! LACP must be active on both ends
! &quot;mode active&quot; = initiate LACP
! &quot;mode passive&quot; = respond only (avoid)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Allowed VLANs&lt;/h3&gt;
&lt;p&gt;Only allow VLANs that should traverse the link:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! WRONG: Allow all VLANs
interface port-channel 10
  switchport trunk allowed vlan all

! RIGHT: Explicit VLAN list
interface port-channel 10
  switchport trunk allowed vlan 100-110,200
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Why it matters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Broadcast domains stay contained&lt;/li&gt;
&lt;li&gt;STP topology is cleaner&lt;/li&gt;
&lt;li&gt;Troubleshooting is easier&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Native VLAN&lt;/h3&gt;
&lt;p&gt;Match native VLAN on both ends to avoid untagged traffic issues:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Set explicit native VLAN
interface port-channel 10
  switchport trunk native vlan 999

! Verify
show interface port-channel 10 trunk
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MTU Configuration&lt;/h3&gt;
&lt;p&gt;Jumbo frames require consistent MTU end-to-end:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! System MTU (affects all L2 interfaces)
system jumbomtu 9216

! Per-interface MTU (L3)
interface Ethernet1/1
  mtu 9216

! Verify
show interface port-channel 10 | include MTU

! Test end-to-end
ping 10.0.1.100 df-bit packet-size 9000
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Port-Channel Verification&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Status overview
show port-channel summary

# Expected output:
# 10     Po10(SU)   Eth      LACP     Eth1/1(P)   Eth1/2(P)
# SU = Layer2, Up
# P = member is up and bundled

# Detailed status
show port-channel database interface port-channel 10

# LACP counters
show lacp counters interface port-channel 10

# Member interface status
show lacp neighbor interface port-channel 10
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Underlay Routing Sanity&lt;/h2&gt;
&lt;h3&gt;OSPF Underlay Checks&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify all adjacencies are FULL
show ip ospf neighbors

# Expected: All neighbors in FULL state
# FULL/DR, FULL/BDR, FULL/DROTHER

# Check for stuck adjacencies
show ip ospf neighbors | include INIT|2WAY|EXSTART

# Verify routes are learned
show ip route ospf

# Check OSPF database consistency
show ip ospf database summary
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP Underlay Checks&lt;/h3&gt;
&lt;p&gt;For eBGP spine/leaf:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# All neighbors established
show bgp ipv4 unicast summary

# Expected: State = Established, or showing prefix count
# Neighbor        V    AS    MsgRcvd  MsgSent   State/PfxRcd
# 10.0.1.1        4    65001 1234     1234      10

# Check for routes from all spines
show bgp ipv4 unicast

# Verify ECMP
show ip route 10.0.2.0/24

# Should show multiple next-hops if ECMP working
# via 10.0.1.1, Eth1/49, via 10.0.1.2, Eth1/50
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;ECMP Behavior&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check maximum ECMP paths
show running-config | include maximum-paths

! Configure if needed
router bgp 65001
  address-family ipv4 unicast
    maximum-paths 4
    maximum-paths ibgp 4

# Verify load balancing
show ip load-sharing

# Test ECMP path selection
show routing hash 10.0.1.100 10.0.2.100 ip
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Timer Alignment&lt;/h3&gt;
&lt;p&gt;Fast convergence requires aggressive timers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! BGP timers
router bgp 65001
  neighbor 10.0.1.1 timers 3 9
  neighbor 10.0.1.1 bfd

! OSPF timers
interface Ethernet1/49
  ip ospf hello-interval 1
  ip ospf dead-interval 3
  ip ospf bfd

! BFD configuration
feature bfd
bfd interval 250 min_rx 250 multiplier 3
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Failure Drills&lt;/h2&gt;
&lt;h3&gt;What to Test Before Go-Live&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Scenario&lt;/th&gt;
&lt;th&gt;Expected Behavior&lt;/th&gt;
&lt;th&gt;Verify&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single uplink fails&lt;/td&gt;
&lt;td&gt;Traffic shifts to other uplinks&lt;/td&gt;
&lt;td&gt;&lt;code&gt;show port-channel summary&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vPC member fails&lt;/td&gt;
&lt;td&gt;vPC still operational&lt;/td&gt;
&lt;td&gt;&lt;code&gt;show vpc brief&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peer-link fails&lt;/td&gt;
&lt;td&gt;Secondary suspends vPCs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;show vpc&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Leaf fails&lt;/td&gt;
&lt;td&gt;Servers failover to peer&lt;/td&gt;
&lt;td&gt;Ping from server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spine fails&lt;/td&gt;
&lt;td&gt;ECMP removes path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;show ip route&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Drill 1: Single Uplink Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On leaf, shut one uplink
interface Ethernet1/49
  shutdown

# Verify:
# 1. Port-channel stays up (degraded)
show port-channel summary

# 2. Routing adjusts
show ip route

# 3. Traffic still flows
ping &amp;lt;destination&amp;gt;

# Restore
no shutdown
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Drill 2: vPC Member Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Shut one member of server port-channel
interface Ethernet1/1
  shutdown

# Verify:
# 1. vPC stays up
show vpc brief

# 2. Server still has connectivity (via peer)
# Test from server

# 3. Traffic flows through peer-link if needed
show interface port-channel &amp;lt;peer-link&amp;gt; counters

# Restore
no shutdown
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Drill 3: Peer-Link Failure&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Caution&lt;/strong&gt;: This is disruptive. Schedule maintenance window.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Simulate peer-link failure
interface port-channel 1  # peer-link
  shutdown

# Expected on secondary:
# - vPCs suspend
# - Peer-keepalive maintains split-brain prevention

show vpc
# Role should show: secondary, operational secondary

# Restore immediately
no shutdown
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Drill 4: Leaf Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Simulate complete leaf failure (reload)
reload

# On peer leaf, verify:
show vpc orphan-ports
show vpc

# Servers should failover to surviving leaf
# vPC ports on surviving leaf stay up
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Drill 5: Spine Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On spine, shut all downlinks
interface Ethernet1/1-48
  shutdown

# On leaves, verify:
# 1. OSPF/BGP removes routes via failed spine
show ip route

# 2. ECMP still works via remaining spine(s)
show ip route &amp;lt;destination&amp;gt;

# 3. Traffic flows
ping &amp;lt;destination&amp;gt; source &amp;lt;loopback&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Pre-Production Checklist&lt;/h2&gt;
&lt;h3&gt;vPC Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[ ] Peer-link is port-channel (not single link)
[ ] Peer-keepalive uses dedicated link/VRF
[ ] Consistency checks pass (show vpc consistency-parameters global)
[ ] Auto-recovery is configured
[ ] Delay restore timers appropriate for environment
[ ] peer-gateway enabled
[ ] layer3 peer-router enabled (if routing on vPC VLANs)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Port-Channel Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[ ] LACP mode active (not passive or on)
[ ] Allowed VLANs explicitly configured
[ ] Native VLAN matches both ends
[ ] MTU consistent end-to-end
[ ] Spanning-tree port type configured (edge for servers)
[ ] BPDU guard enabled on edge ports
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Underlay Routing Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[ ] All adjacencies FULL/Established
[ ] Routes learned from all spines
[ ] ECMP working (multiple next-hops)
[ ] BFD enabled for fast failure detection
[ ] Timers aligned (hello/dead intervals)
[ ] Loopback addresses reachable from all leaves
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Failure Testing Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[ ] Single uplink failure tested
[ ] vPC member failure tested
[ ] Peer-link failure tested (with maintenance window)
[ ] Leaf failure simulated
[ ] Spine failure simulated
[ ] Convergence time documented
[ ] Alerts verified during failures
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring Commands Summary&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# vPC status
show vpc
show vpc brief
show vpc peer-link
show vpc consistency-parameters global
show vpc orphan-ports

# Port-channel status
show port-channel summary
show port-channel database
show lacp counters
show lacp neighbor

# Routing status
show ip ospf neighbors
show bgp ipv4 unicast summary
show ip route
show ip route summary

# Interface status
show interface status
show interface trunk
show interface counters errors

# Spanning tree
show spanning-tree summary
show spanning-tree vlan &amp;lt;id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;Spine/leaf operations require:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;vPC hygiene&lt;/strong&gt; — consistency checks, proper peer-link/keepalive, recovery settings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port-channel discipline&lt;/strong&gt; — LACP active, explicit VLANs, matching MTU&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Underlay verification&lt;/strong&gt; — all adjacencies up, ECMP working, fast timers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failure drills&lt;/strong&gt; — test every failure scenario before go-live&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The fabric that &quot;just works&quot; in the lab will surprise you in production when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A link fails and traffic asymmetry begins&lt;/li&gt;
&lt;li&gt;A leaf reboots and half the vPCs suspend&lt;/li&gt;
&lt;li&gt;ECMP doesn&apos;t balance because of misconfigured maximum-paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test failures before they test you. Document expected behavior. Verify convergence times. A spine/leaf fabric is only as reliable as your preparation.&lt;/p&gt;
</content:encoded><category>cisco</category><category>networking</category><category>ha</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Cisco IOS-XE Edge Baseline: AAA, SSH, ACL, Logging, and IP SLA</title><link>https://ashimov.com/posts/cisco-iosxe-baseline/</link><guid isPermaLink="true">https://ashimov.com/posts/cisco-iosxe-baseline/</guid><description>Build a production-ready IOS-XE edge router. Covers secure management, IP SLA tracking for real failover, logging configuration, and common mistakes that break production.</description><pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Internet works — sometimes. SSH is open to the world. No logs. NTP not configured, so timestamps are meaningless. When an incident happens, investigation is impossible.&lt;/p&gt;
&lt;p&gt;This is the typical state of edge routers. Nobody configures them properly from day one, and technical debt accumulates until a breach forces action.&lt;/p&gt;
&lt;p&gt;Here&apos;s the baseline every IOS-XE edge router needs.&lt;/p&gt;
&lt;h2&gt;Secure Management Plane&lt;/h2&gt;
&lt;h3&gt;AAA Configuration&lt;/h3&gt;
&lt;p&gt;Always configure AAA, even for local authentication:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Enable AAA
aaa new-model

! Local authentication fallback
aaa authentication login default local
aaa authentication login CONSOLE local
aaa authorization console
aaa authorization exec default local

! Create local admin user with privilege 15
username admin privilege 15 algorithm-type scrypt secret &amp;lt;strong-password&amp;gt;

! Alternative: TACACS+ with local fallback
aaa authentication login default group tacacs+ local
aaa authorization exec default group tacacs+ local
aaa accounting exec default start-stop group tacacs+

tacacs server PRIMARY
 address ipv4 10.0.1.100
 key 7 &amp;lt;encrypted-key&amp;gt;
 timeout 3
tacacs server SECONDARY
 address ipv4 10.0.1.101
 key 7 &amp;lt;encrypted-key&amp;gt;
 timeout 3

aaa group server tacacs+ TACACS-SERVERS
 server name PRIMARY
 server name SECONDARY
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SSH Hardening&lt;/h3&gt;
&lt;p&gt;Disable telnet. Configure SSH properly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Generate RSA key (2048 minimum, 4096 preferred)
crypto key generate rsa modulus 4096 label SSH-KEY

! SSH version 2 only
ip ssh version 2
ip ssh time-out 60
ip ssh authentication-retries 3

! Disable weak algorithms
ip ssh server algorithm encryption aes256-ctr aes192-ctr aes128-ctr
ip ssh server algorithm mac hmac-sha2-256 hmac-sha2-512

! VTY configuration
line vty 0 15
 login authentication default
 transport input ssh
 transport output ssh
 exec-timeout 15 0
 logging synchronous
 access-class VTY-ACCESS in

! Console
line con 0
 login authentication CONSOLE
 exec-timeout 15 0
 logging synchronous
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VTY Access Control&lt;/h3&gt;
&lt;p&gt;Restrict who can SSH:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! ACL for management access
ip access-list extended VTY-ACCESS
 10 permit tcp 10.0.0.0 0.255.255.255 any eq 22 log
 20 permit tcp 192.168.1.0 0.0.0.255 any eq 22 log
 30 deny ip any any log

! Apply to VTY lines
line vty 0 15
 access-class VTY-ACCESS in
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SNMPv3&lt;/h3&gt;
&lt;p&gt;If you need SNMP, use v3 with authentication and encryption:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Disable SNMP v1/v2c
no snmp-server community public
no snmp-server community private

! SNMPv3 configuration
snmp-server group MONITORING v3 priv
snmp-server user monitor MONITORING v3 auth sha &amp;lt;auth-password&amp;gt; priv aes 256 &amp;lt;priv-password&amp;gt;

! Restrict SNMP source
snmp-server host 10.0.1.50 version 3 priv monitor

! ACL to restrict SNMP (apply to interface or use control-plane)
ip access-list extended SNMP-ACCESS
 permit udp host 10.0.1.50 any eq snmp
 deny udp any any eq snmp log
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;IP SLA for Real Failover&lt;/h2&gt;
&lt;h3&gt;The Problem with Link State&lt;/h3&gt;
&lt;p&gt;Interface up ≠ Internet works. Your uplink can be &quot;up&quot; while the ISP has internal issues.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Interface is UP
Router#show ip interface brief
GigabitEthernet0/0/0     203.0.113.2    YES NVRAM  up      up

! But Internet is unreachable
Router#ping 8.8.8.8
.....
Success rate is 0 percent (0/5)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;IP SLA Configuration&lt;/h3&gt;
&lt;p&gt;Track actual reachability, not just link state:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! ICMP echo to reliable target
ip sla 1
 icmp-echo 8.8.8.8 source-interface GigabitEthernet0/0/0
 frequency 10
 threshold 1000
 timeout 2000
ip sla schedule 1 life forever start-time now

ip sla 2
 icmp-echo 1.1.1.1 source-interface GigabitEthernet0/0/0
 frequency 10
 threshold 1000
 timeout 2000
ip sla schedule 2 life forever start-time now

! Track SLA results
track 1 ip sla 1 reachability
track 2 ip sla 2 reachability

! Track both (require both to be up)
track 10 list boolean and
 object 1
 object 2

! Or track either (failover if both fail)
track 20 list boolean or
 object 1
 object 2
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Static Route with Tracking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! Primary default route (tracked)
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/0 203.0.113.1 10 track 10

! Backup default route (higher metric, activates when primary fails)
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/1 198.51.100.1 20
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When track 10 goes down, primary route is removed, backup takes over.&lt;/p&gt;
&lt;h3&gt;Verify SLA Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Router#show ip sla statistics
IPSLAs Latest Operation Statistics

IPSLA operation id: 1
        Latest RTT: 12 milliseconds
Latest operation start time: 10:30:15 UTC Thu Mar 13 2026
Latest operation return code: OK
Number of successes: 1000
Number of failures: 2

Router#show track
Track 1
  IP SLA 1 reachability
  Reachability is Up
    1 change, last change 00:15:32
  Latest operation return code: OK
  Latest RTT (millisecs) 12

Track 10
  List boolean and
  Boolean AND is Up
    1 change, last change 00:15:32
  object 1 Up
  object 2 Up
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Logging Configuration&lt;/h2&gt;
&lt;h3&gt;Timestamps and Buffered Logging&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! Enable timestamps on all logs
service timestamps debug datetime msec localtime show-timezone
service timestamps log datetime msec localtime show-timezone

! Buffered logging (local storage)
logging buffered 1000000 informational
logging buffered xml

! Console logging (limit to critical)
logging console critical

! Monitor logging (terminal)
logging monitor informational
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Syslog Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! Remote syslog servers
logging host 10.0.1.50 transport udp port 514
logging host 10.0.1.51 transport tcp port 514

! Logging source interface
logging source-interface Loopback0

! Facility
logging facility local6

! Log level
logging trap informational
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Archive Logging&lt;/h3&gt;
&lt;p&gt;Capture configuration changes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Archive configuration
archive
 log config
  logging enable
  logging size 500
  notify syslog contenttype plaintext
  hidekeys

! View config changes
Router#show archive log config all
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;NTP Configuration&lt;/h3&gt;
&lt;p&gt;Without accurate time, logs are useless for incident response:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! NTP servers
ntp server 10.0.1.10 prefer
ntp server 10.0.1.11

! Or public NTP (less ideal)
ntp server 0.pool.ntp.org
ntp server 1.pool.ntp.org

! NTP authentication (optional but recommended)
ntp authenticate
ntp authentication-key 1 md5 &amp;lt;key&amp;gt;
ntp trusted-key 1
ntp server 10.0.1.10 key 1

! Timezone
clock timezone UTC 0
! or
clock timezone EST -5
clock summer-time EDT recurring

! Verify
Router#show ntp status
Clock is synchronized, stratum 3, reference is 10.0.1.10
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;h3&gt;Mistake 1: ACL Applied Wrong Direction&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! WRONG: ACL blocking return traffic
interface GigabitEthernet0/0/0
 ip access-group INBOUND in   ! This blocks return traffic!

! The ACL:
ip access-list extended INBOUND
 permit tcp any any eq 80
 deny ip any any

! Traffic flows: Inside → Outside (port 80)
! Return traffic: Outside → Inside (source port 80, dest port random)
! ACL blocks the return because dest port isn&apos;t 80!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Use reflexive ACLs or proper stateful inspection:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Better approach: permit established
ip access-list extended INBOUND
 permit tcp any any established
 permit tcp any any eq 80
 deny ip any any log
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 2: SLA Checks Wrong Target&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! WRONG: Checking ISP&apos;s gateway only
ip sla 1
 icmp-echo 203.0.113.1   ! ISP gateway
 source-interface GigabitEthernet0/0/0

! ISP gateway is up, but their upstream is down
! Your router thinks everything is fine
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Check destinations beyond ISP&apos;s network:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Better: Check real Internet destinations
ip sla 1
 icmp-echo 8.8.8.8       ! Google DNS
ip sla 2
 icmp-echo 1.1.1.1       ! Cloudflare DNS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 3: NAT + ACL Ordering&lt;/h3&gt;
&lt;p&gt;NAT changes addresses. ACL evaluation order matters:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Inbound traffic:
! 1. ACL on interface (original destination IP)
! 2. NAT translation (changes destination)
! 3. Routing (uses NAT&apos;d address)

! Outbound traffic:
! 1. Routing decision
! 2. NAT translation (changes source)
! 3. ACL on interface (NAT&apos;d source IP!)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Understand where your ACL is evaluated:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Inbound ACL - matches ORIGINAL destination
interface GigabitEthernet0/0/0
 ip nat outside
 ip access-group OUTSIDE-IN in

! ACL matches the public IP, before NAT translation
ip access-list extended OUTSIDE-IN
 permit tcp any host 203.0.113.10 eq 443  ! Public IP
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 4: Forgetting logging on deny&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;! WRONG: Silent drops
ip access-list extended BLOCK-BAD
 deny ip 10.0.0.0 0.255.255.255 any
 permit ip any any

! Can&apos;t tell what&apos;s being blocked
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Always log deny actions:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! Better: Log blocked traffic
ip access-list extended BLOCK-BAD
 deny ip 10.0.0.0 0.255.255.255 any log
 permit ip any any
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Complete Edge Baseline&lt;/h2&gt;
&lt;p&gt;Putting it all together:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;! === MANAGEMENT ===
aaa new-model
aaa authentication login default local
aaa authorization exec default local

username admin privilege 15 algorithm-type scrypt secret &amp;lt;password&amp;gt;

ip ssh version 2
ip ssh time-out 60
crypto key generate rsa modulus 4096 label SSH-KEY

line vty 0 15
 login authentication default
 transport input ssh
 exec-timeout 15 0
 access-class VTY-ACCESS in

ip access-list extended VTY-ACCESS
 permit tcp 10.0.0.0 0.255.255.255 any eq 22 log
 deny ip any any log

! === TIME ===
clock timezone UTC 0
ntp server 10.0.1.10 prefer
ntp server 10.0.1.11

! === LOGGING ===
service timestamps debug datetime msec localtime show-timezone
service timestamps log datetime msec localtime show-timezone
logging buffered 1000000 informational
logging host 10.0.1.50
logging source-interface Loopback0
logging trap informational

archive
 log config
  logging enable

! === IP SLA ===
ip sla 1
 icmp-echo 8.8.8.8 source-interface GigabitEthernet0/0/0
 frequency 10
ip sla schedule 1 life forever start-time now

ip sla 2
 icmp-echo 1.1.1.1 source-interface GigabitEthernet0/0/0
 frequency 10
ip sla schedule 2 life forever start-time now

track 10 list boolean and
 object 1
 object 2

! === ROUTING ===
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/0 203.0.113.1 10 track 10
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/1 198.51.100.1 20
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verification Commands&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;! AAA status
show aaa sessions
show aaa servers

! SSH sessions
show ssh
show ip ssh

! SLA status
show ip sla statistics
show ip sla configuration
show track
show track brief

! Logging
show logging
show archive log config all

! NTP
show ntp status
show ntp associations

! ACL hits
show access-lists
show ip access-lists VTY-ACCESS
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;Every edge router needs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Secure management&lt;/strong&gt; — AAA, SSH-only, ACL on VTY&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real failover&lt;/strong&gt; — IP SLA tracking actual reachability, not just link state&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proper logging&lt;/strong&gt; — timestamps, buffered, syslog, NTP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configuration auditing&lt;/strong&gt; — archive log config&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Common mistakes that break production:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ACL in wrong direction (blocks return traffic)&lt;/li&gt;
&lt;li&gt;SLA checking ISP gateway instead of Internet destinations&lt;/li&gt;
&lt;li&gt;NAT/ACL ordering confusion&lt;/li&gt;
&lt;li&gt;Silent denies without logging&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This baseline isn&apos;t optional — it&apos;s the minimum for production. Configure it on day one, not after an incident forces you to investigate with no logs and wrong timestamps.&lt;/p&gt;
</content:encoded><category>cisco</category><category>networking</category><category>security</category><category>ha</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Junos Routing Policy That Scales: Policy-Statement Patterns and Safe Defaults</title><link>https://ashimov.com/posts/juniper-routing-policy/</link><guid isPermaLink="true">https://ashimov.com/posts/juniper-routing-policy/</guid><description>Design maintainable Junos routing policies. Covers policy-statement structure, community naming, prefix-lists, and safe defaults that prevent routing disasters.</description><pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Config grew over years. Multiple engineers added policies. Nobody documented communities. New engineer joins, asks &quot;what does community 65000:999 mean?&quot; — silence. &quot;Don&apos;t touch it — it works.&quot;&lt;/p&gt;
&lt;p&gt;This is how routing policies become unmaintainable. Junos provides powerful policy tools, but power without structure creates chaos.&lt;/p&gt;
&lt;h2&gt;Junos Policy Mental Model&lt;/h2&gt;
&lt;h3&gt;Terms Evaluate Top to Bottom&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;policy-statement EXAMPLE {
    term FIRST {
        from { ... }
        then accept;
    }
    term SECOND {
        from { ... }
        then reject;
    }
    term DEFAULT {
        then reject;  # Explicit default
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First matching term wins. If FIRST matches, SECOND never evaluates. No match in any term? &lt;strong&gt;Implicit accept&lt;/strong&gt; — the silent danger.&lt;/p&gt;
&lt;h3&gt;The Implicit Accept Problem&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Dangerous policy - implicit accept at end
policy-statement FILTER-ROUTES {
    term BLOCK-BOGONS {
        from {
            route-filter 10.0.0.0/8 orlonger;
        }
        then reject;
    }
    # No default term = everything else ACCEPTED
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Traffic you didn&apos;t explicitly handle gets accepted. Always add explicit default:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Safe policy - explicit default
policy-statement FILTER-ROUTES {
    term BLOCK-BOGONS {
        from {
            route-filter 10.0.0.0/8 orlonger;
        }
        then reject;
    }
    term DEFAULT-DENY {
        then reject;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Accept vs Next Policy vs Next Term&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;then accept;       # Accept route, stop processing THIS policy
then reject;       # Reject route, stop processing THIS policy
then next policy;  # Continue to next policy in chain
then next term;    # Continue to next term in THIS policy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Multiple policies can be chained:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set protocols bgp group PEERS import [ POLICY-1 POLICY-2 POLICY-3 ]
# Evaluates POLICY-1, then POLICY-2, then POLICY-3
# First explicit accept/reject wins
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Policy Building Blocks&lt;/h2&gt;
&lt;h3&gt;Prefix Lists&lt;/h3&gt;
&lt;p&gt;Named lists of prefixes. Reusable across policies.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Define prefix-list
set policy-options prefix-list INTERNAL-NETWORKS 10.0.0.0/8
set policy-options prefix-list INTERNAL-NETWORKS 172.16.0.0/12
set policy-options prefix-list INTERNAL-NETWORKS 192.168.0.0/16

# Use in policy
set policy-options policy-statement ALLOW-INTERNAL term MATCH from prefix-list INTERNAL-NETWORKS
set policy-options policy-statement ALLOW-INTERNAL term MATCH then accept
set policy-options policy-statement ALLOW-INTERNAL term DEFAULT then reject
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Route Filters&lt;/h3&gt;
&lt;p&gt;More granular than prefix-lists. Match exact prefixes or ranges.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Exact match
route-filter 10.0.0.0/24 exact;

# Match this and longer (subnets)
route-filter 10.0.0.0/16 orlonger;

# Match longer only (not the /16 itself)
route-filter 10.0.0.0/16 longer;

# Match range
route-filter 10.0.0.0/16 prefix-length-range /24-/28;

# Match up to a length
route-filter 10.0.0.0/8 upto /24;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Practical example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;policy-statement CUSTOMER-ROUTES {
    term ACCEPT-ALLOCATED {
        from {
            route-filter 203.0.113.0/24 orlonger;  # Customer&apos;s allocation
            route-filter 198.51.100.0/24 orlonger; # Second allocation
        }
        then accept;
    }
    term REJECT-REST {
        then reject;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;AS Path Filters&lt;/h3&gt;
&lt;p&gt;Match routes by AS path patterns.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Define AS path regex
set policy-options as-path ORIGIN-65001 &quot;.* 65001&quot;
set policy-options as-path DIRECT-PEER &quot;^65001$&quot;
set policy-options as-path TRANSIT &quot;.* 65001 .*&quot;

# Use in policy
policy-statement PREFER-DIRECT {
    term DIRECT {
        from as-path DIRECT-PEER;
        then {
            local-preference 150;
            accept;
        }
    }
    term TRANSIT {
        from as-path TRANSIT;
        then {
            local-preference 100;
            accept;
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;AS path regex patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;^&lt;/code&gt; — start of path&lt;/li&gt;
&lt;li&gt;&lt;code&gt;$&lt;/code&gt; — end of path&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.&lt;/code&gt; — any single AS&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.*&lt;/code&gt; — zero or more ASes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[0-9]+&lt;/code&gt; — one or more digits (any AS number)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Communities&lt;/h3&gt;
&lt;p&gt;Tags attached to routes. The glue for policy communication.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Define communities
set policy-options community CUSTOMER-ROUTES members 65000:100
set policy-options community NO-EXPORT members no-export
set policy-options community BLACKHOLE members 65000:666

# Match community
policy-statement CUSTOMER-IMPORT {
    term TAGGED {
        from community CUSTOMER-ROUTES;
        then accept;
    }
}

# Set community
policy-statement TAG-OUTBOUND {
    term ADD-TAG {
        then {
            community add CUSTOMER-ROUTES;
            accept;
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Community Design That Scales&lt;/h2&gt;
&lt;h3&gt;Naming Convention&lt;/h3&gt;
&lt;p&gt;Without documentation, &lt;code&gt;65000:100&lt;/code&gt; means nothing. Create a system:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Pattern: ASN:TYPE+VALUE
# Types:
#   1xx = Origin (where route came from)
#   2xx = Region
#   3xx = Customer type
#   4xx = Traffic engineering
#   666 = Blackhole

# Examples:
set policy-options community ORIGIN-CUSTOMER members 65000:100
set policy-options community ORIGIN-PEER members 65000:101
set policy-options community ORIGIN-TRANSIT members 65000:102

set policy-options community REGION-US-EAST members 65000:201
set policy-options community REGION-US-WEST members 65000:202
set policy-options community REGION-EU members 65000:203

set policy-options community TYPE-ENTERPRISE members 65000:301
set policy-options community TYPE-RESIDENTIAL members 65000:302

set policy-options community TE-BACKUP-ONLY members 65000:401
set policy-options community TE-PRIMARY members 65000:402

set policy-options community BLACKHOLE members 65000:666
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Document in Config&lt;/h3&gt;
&lt;p&gt;Junos supports description on most objects. Use it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set policy-options community ORIGIN-CUSTOMER members 65000:100
annotate policy-options community ORIGIN-CUSTOMER &quot;Routes learned from direct customers&quot;

set policy-options community BLACKHOLE members 65000:666
annotate policy-options community BLACKHOLE &quot;Trigger RTBH - null route this prefix&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Community Groups&lt;/h3&gt;
&lt;p&gt;Group related communities for easier matching:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Define community group
set policy-options community ALL-ORIGINS members &quot;65000:10[0-9]&quot;
set policy-options community ALL-REGIONS members &quot;65000:2[0-9][0-9]&quot;

# Match any origin community
policy-statement CHECK-ORIGIN {
    term HAS-ORIGIN {
        from community ALL-ORIGINS;
        then accept;
    }
    term MISSING-ORIGIN {
        then reject;  # Reject routes without origin tag
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Safe Defaults&lt;/h2&gt;
&lt;h3&gt;Bogon Filtering&lt;/h3&gt;
&lt;p&gt;Always filter RFC1918, documentation prefixes, and other bogons:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Bogon prefix-list (base prefixes)
set policy-options prefix-list BOGONS 0.0.0.0/8
set policy-options prefix-list BOGONS 10.0.0.0/8
set policy-options prefix-list BOGONS 100.64.0.0/10
set policy-options prefix-list BOGONS 127.0.0.0/8
set policy-options prefix-list BOGONS 169.254.0.0/16
set policy-options prefix-list BOGONS 172.16.0.0/12
set policy-options prefix-list BOGONS 192.0.0.0/24
set policy-options prefix-list BOGONS 192.0.2.0/24
set policy-options prefix-list BOGONS 192.168.0.0/16
set policy-options prefix-list BOGONS 198.18.0.0/15
set policy-options prefix-list BOGONS 198.51.100.0/24
set policy-options prefix-list BOGONS 203.0.113.0/24
set policy-options prefix-list BOGONS 224.0.0.0/4
set policy-options prefix-list BOGONS 240.0.0.0/4

# Apply with orlonger match (catches subnets too)
policy-statement REJECT-BOGONS {
    term BOGONS {
        from {
            prefix-list-filter BOGONS orlonger;
        }
        then reject;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Max Prefix Protection&lt;/h3&gt;
&lt;p&gt;Limit prefixes accepted from peers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Limit with teardown
set protocols bgp group CUSTOMERS neighbor 192.0.2.1 family inet unicast prefix-limit maximum 1000
set protocols bgp group CUSTOMERS neighbor 192.0.2.1 family inet unicast prefix-limit teardown 80 idle-timeout 30

# Teardown at 80% (800 prefixes), wait 30 minutes before retry
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Reject Unless Explicitly Allowed&lt;/h3&gt;
&lt;p&gt;Default-deny at BGP group level:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Import policy for customer
policy-statement CUSTOMER-192-0-2-1-IMPORT {
    term ACCEPT-ANNOUNCED {
        from {
            prefix-list CUSTOMER-192-0-2-1-PREFIXES;
        }
        then {
            community add ORIGIN-CUSTOMER;
            accept;
        }
    }
    term REJECT-ALL {
        then reject;
    }
}

# Customer&apos;s allowed prefixes
set policy-options prefix-list CUSTOMER-192-0-2-1-PREFIXES 198.51.100.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;AS Path Sanity&lt;/h3&gt;
&lt;p&gt;Reject private ASNs and your own ASN from external peers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Private ASN range (64512-65534)
set policy-options as-path PRIVATE-ASN &quot;.* (6451[2-9]|645[2-9][0-9]|6[5-9][0-4][0-9]{2}|655[0-2][0-9]|6553[0-4]) .*&quot;

# Simpler alternative: match specific private ASNs you might see
set policy-options as-path PRIVATE-ASN-SIMPLE &quot;64[5-9][0-9]{2}|65[0-4][0-9]{2}|655[0-3][0-4]&quot;

# Own ASN in path (shouldn&apos;t happen from external)
set policy-options as-path OWN-ASN &quot;.* 65000 .*&quot;

policy-statement SANITY-CHECK {
    term REJECT-PRIVATE-ASN {
        from as-path PRIVATE-ASN;
        then reject;
    }
    term REJECT-OWN-ASN {
        from as-path OWN-ASN;
        then reject;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note: AS path regex for full private range coverage is complex. Many operators maintain external prefix/AS-path lists (e.g., from Team Cymru or RIPE) rather than hand-crafted regex.&lt;/p&gt;
&lt;h2&gt;Policy Structure Patterns&lt;/h2&gt;
&lt;h3&gt;Layered Import Policy&lt;/h3&gt;
&lt;p&gt;Build policies in layers for maintainability:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Layer 1: Sanity checks (apply to all)
policy-statement IMPORT-SANITY {
    term REJECT-BOGONS { from prefix-list BOGONS; then reject; }
    term REJECT-TOO-LONG { from route-filter 0.0.0.0/0 prefix-length-range /25-/32; then reject; }
    term REJECT-DEFAULT { from route-filter 0.0.0.0/0 exact; then reject; }
    term CONTINUE { then next policy; }
}

# Layer 2: Peer-specific acceptance
policy-statement IMPORT-PEER-65001 {
    term ACCEPT-PREFIXES {
        from prefix-list PEER-65001-PREFIXES;
        then {
            community add ORIGIN-PEER;
            local-preference 100;
            accept;
        }
    }
    term REJECT-REST { then reject; }
}

# Apply both
set protocols bgp group PEERS neighbor 192.0.2.1 import [ IMPORT-SANITY IMPORT-PEER-65001 ]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Export Policy Template&lt;/h3&gt;
&lt;p&gt;Consistent export structure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;policy-statement EXPORT-TO-PEERS {
    term EXPORT-CUSTOMERS {
        from community ORIGIN-CUSTOMER;
        then {
            community delete ALL-INTERNAL;  # Strip internal communities
            accept;
        }
    }
    term EXPORT-OWN {
        from {
            protocol direct;
            prefix-list OWN-PREFIXES;
        }
        then accept;
    }
    term REJECT-REST {
        then reject;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Debugging Policies&lt;/h2&gt;
&lt;h3&gt;Test Policy Match&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Test which policy term matches a route
test policy POLICY-NAME 192.0.2.0/24

# Output shows:
# Route 192.0.2.0/24
#   Term: ACCEPT-CUSTOMERS
#   Action: accept
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Received vs Active&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# What peer is sending (before import policy)
show route receive-protocol bgp 192.0.2.1

# What we accepted (after import policy)
show route protocol bgp neighbor 192.0.2.1

# Compare to find filtered routes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hidden Routes&lt;/h3&gt;
&lt;p&gt;Routes filtered by policy become &quot;hidden&quot;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Show hidden routes
show route hidden

# Why is it hidden?
show route 192.0.2.0/24 hidden extensive
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Policy Decision Flow&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│                  Import Policy Chain                      │
├─────────────────────────────────────────────────────────┤
│                                                           │
│  Policy 1          Policy 2          Policy 3            │
│  ┌─────────┐      ┌─────────┐      ┌─────────┐          │
│  │ Term A  │─no─→│ Term A  │─no─→│ Term A  │          │
│  └────┬────┘      └────┬────┘      └────┬────┘          │
│       │yes             │yes             │yes             │
│       ↓                ↓                ↓                │
│  [accept/reject]  [accept/reject]  [accept/reject]      │
│                                                           │
│  If no match in any term of any policy:                  │
│  → IMPLICIT ACCEPT (danger!)                             │
│                                                           │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;Junos routing policy is powerful but requires discipline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Always explicit default&lt;/strong&gt; — never rely on implicit accept&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Name everything meaningfully&lt;/strong&gt; — communities, prefix-lists, policies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document in config&lt;/strong&gt; — use &lt;code&gt;annotate&lt;/code&gt; liberally&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Layer your policies&lt;/strong&gt; — sanity checks separate from peer-specific logic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test before commit&lt;/strong&gt; — use &lt;code&gt;test policy&lt;/code&gt; command&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A well-structured policy config is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Readable by new engineers&lt;/li&gt;
&lt;li&gt;Modifiable without fear&lt;/li&gt;
&lt;li&gt;Auditable for compliance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The goal isn&apos;t clever regex — it&apos;s maintainability. If you can&apos;t explain what &lt;code&gt;65000:247&lt;/code&gt; means without checking documentation, your community scheme needs work.&lt;/p&gt;
&lt;p&gt;Policy-statement is an engineering system. Treat it like code: structured, documented, tested.&lt;/p&gt;
</content:encoded><category>juniper</category><category>bgp</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Junos SRX Security Policies in Real Life: Why Traffic Doesn&apos;t Match</title><link>https://ashimov.com/posts/juniper-srx-policies/</link><guid isPermaLink="true">https://ashimov.com/posts/juniper-srx-policies/</guid><description>Debug SRX policy issues when traffic flows wrong or NAT fails. Covers zone chain, policy hit counters, flow trace, and the top 5 reasons policies never match.</description><pubDate>Fri, 06 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Everything configured. Policy created. Commit complete. Traffic doesn&apos;t flow. Or flows to the wrong place. Or NAT doesn&apos;t apply. Welcome to SRX troubleshooting.&lt;/p&gt;
&lt;p&gt;SRX security processing has a specific order. Understanding that order — and knowing where to look when things break — separates frustrating hours from quick fixes.&lt;/p&gt;
&lt;h2&gt;How SRX Actually Processes Traffic&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Packet arrives
    ↓
Interface → Security Zone lookup
    ↓
Route lookup (egress interface/zone)
    ↓
Security Policy evaluation (from-zone → to-zone)
    ↓
NAT processing (if matched)
    ↓
Forward packet
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The critical insight: &lt;strong&gt;zone is determined by interface, policy is matched by zone pair&lt;/strong&gt;. If your zones are wrong, your policy will never match.&lt;/p&gt;
&lt;h3&gt;The Zone Chain&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Interface  │ ──→ │    Zone     │ ──→ │   Policy    │
│   ge-0/0/1  │     │   trust     │     │ trust→untrust│
└─────────────┘     └─────────────┘     └─────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Traffic from ge-0/0/1 (trust zone) going to ge-0/0/0 (untrust zone) needs a policy from trust to untrust. Simple concept, endless misconfigurations.&lt;/p&gt;
&lt;h2&gt;Top 5 Reasons &quot;Policy Not Hit&quot;&lt;/h2&gt;
&lt;h3&gt;1. Zone Mismatch&lt;/h3&gt;
&lt;p&gt;The most common issue. Traffic enters through one interface but policy expects a different zone.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check interface zone assignment
show security zones

# Output:
# Security zone: trust
#   Interfaces bound: 1
#     ge-0/0/1.0
# Security zone: untrust
#   Interfaces bound: 1
#     ge-0/0/0.0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix pattern&lt;/strong&gt;: Verify source and destination zones match your policy exactly.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# See which zone traffic is hitting
show security flow session source-prefix 192.168.1.0/24

# Check policy for that zone pair
show security policies from-zone trust to-zone untrust
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Address Book Scope&lt;/h3&gt;
&lt;p&gt;Address book entries are zone-scoped by default. Address defined in zone A is invisible to policies involving zone B.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Wrong: Address in wrong zone&apos;s address book
set security zones security-zone untrust address-book address internal-server 10.0.1.100/32

# Policy from trust→untrust can&apos;t see this address!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix pattern&lt;/strong&gt;: Use global address book for addresses used across zones.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Correct: Global address book
set security address-book global address internal-server 10.0.1.100/32
set security address-book global address-set servers address internal-server

# Now usable in any policy
set security policies from-zone trust to-zone untrust policy allow-server match destination-address internal-server
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Application/ALG Issues&lt;/h3&gt;
&lt;p&gt;Application identification can prevent matches. ALG (Application Layer Gateway) can modify packets unexpectedly.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Policy expects specific application
set security policies from-zone trust to-zone untrust policy web-traffic match application junos-http

# But traffic is HTTPS on non-standard port - doesn&apos;t match junos-http
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix pattern&lt;/strong&gt;: Use application sets or &lt;code&gt;any&lt;/code&gt; for troubleshooting, then narrow down.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Debug: What application is SRX seeing?
show security flow session extensive

# Look for &quot;Application:&quot; field
# Session ID: 12345, Application: junos-https, ...

# For non-standard ports
set applications application custom-https protocol tcp destination-port 8443
set security policies from-zone trust to-zone untrust policy web-traffic match application custom-https
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Routing Instance Issues&lt;/h3&gt;
&lt;p&gt;Traffic in a routing instance may not hit policies as expected. VRF leaking adds complexity.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Traffic in VRF but policy in main instance
show route table CUSTOMER-VRF 10.0.0.0/24

# Check session table for routing-instance
show security flow session routing-instance CUSTOMER-VRF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix pattern&lt;/strong&gt;: Ensure policy exists for the routing instance context.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Policy must reference correct zones
# Zones are global, but routing affects egress interface selection
show security zones security-zone trust interfaces
# Verify interface is in expected routing instance
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. NAT Order Confusion&lt;/h3&gt;
&lt;p&gt;Source NAT happens after policy match. Destination NAT happens before. This ordering causes endless confusion.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Inbound traffic:
  1. Destination NAT (change dest IP)
  2. Route lookup (with new dest)
  3. Policy match (with original source, NAT&apos;d destination)
  4. Source NAT (if configured)

Outbound traffic:
  1. Route lookup
  2. Policy match (original IPs)
  3. Source NAT
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Fix pattern&lt;/strong&gt;: For destination NAT, policy must match the &lt;strong&gt;post-NAT&lt;/strong&gt; destination.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Destination NAT: public 203.0.113.10 → private 10.0.1.100
set security nat destination pool server-pool address 10.0.1.100/32
set security nat destination rule-set inbound rule to-server match destination-address 203.0.113.10/32
set security nat destination rule-set inbound rule to-server then destination-nat pool server-pool

# Policy must match the NAT&apos;d destination (10.0.1.100), not public IP!
set security policies from-zone untrust to-zone trust policy allow-server match destination-address 10.0.1.100/32
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Debug Workflow&lt;/h2&gt;
&lt;h3&gt;Step 1: Session Table&lt;/h3&gt;
&lt;p&gt;First, check if session exists:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show security flow session

# Filter by IP
show security flow session source-prefix 192.168.1.100/32
show security flow session destination-prefix 10.0.1.100/32

# Extensive output shows policy that matched
show security flow session extensive
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Session exists? Check which policy matched. No session? Traffic isn&apos;t flowing or policy is denying.&lt;/p&gt;
&lt;h3&gt;Step 2: Policy Hit Counters&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all policies with hit counts
show security policies hit-count

# Output:
# Logical system: root-logical-system
# Index   From zone        To zone          Name             Policy count
# 1       trust            untrust          allow-web        12543
# 2       trust            untrust          allow-dns        3421
# 3       trust            untrust          deny-all         89

# Zero hits on your policy? It&apos;s not matching.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clear and retest:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Clear counters
clear security policies hit-count

# Generate test traffic, then check again
show security policies hit-count
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Flow Trace&lt;/h3&gt;
&lt;p&gt;The nuclear option. Enables detailed logging of packet processing.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Enable traceoptions
set security flow traceoptions file flow-trace
set security flow traceoptions file size 10m
set security flow traceoptions flag basic-datapath
set security flow traceoptions packet-filter my-filter source-prefix 192.168.1.100/32
commit

# Generate traffic, then view trace
show log flow-trace

# CRITICAL: Disable when done (performance impact!)
delete security flow traceoptions
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Reading Flow Trace Output&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;flow_process_pkt: received packet on ge-0/0/1.0, src 192.168.1.100, dst 8.8.8.8
  flow_first_policy_search: policy search from zone trust to zone untrust
  flow_first_policy_search: policy found: allow-web
  flow_first_routing_lookup: dst 8.8.8.8, ifl 72 (ge-0/0/0.0)
  flow_first_install_session: session installed successfully
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key lines to watch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;policy search from zone X to zone Y&lt;/code&gt; — confirms zone pair&lt;/li&gt;
&lt;li&gt;&lt;code&gt;policy found: &amp;lt;name&amp;gt;&lt;/code&gt; — which policy matched&lt;/li&gt;
&lt;li&gt;&lt;code&gt;policy search: no policy match&lt;/code&gt; — the dreaded no-match&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Fix Patterns&lt;/h2&gt;
&lt;h3&gt;Pattern 1: &quot;Any&quot; Debugging&lt;/h3&gt;
&lt;p&gt;When nothing works, start with wide-open policy to confirm traffic flow:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Temporary debug policy (REMOVE AFTER!)
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW match source-address any
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW match destination-address any
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW match application any
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW then permit
insert security policies from-zone trust to-zone untrust policy DEBUG-ALLOW before policy &amp;lt;first-policy&amp;gt;
commit

# Traffic works? Problem is in policy specifics.
# Traffic fails? Problem is zones, routing, or NAT.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern 2: Explicit Logging&lt;/h3&gt;
&lt;p&gt;Enable session logs to see what&apos;s happening:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Log at session init and close
set security policies from-zone trust to-zone untrust policy allow-web then log session-init
set security policies from-zone trust to-zone untrust policy allow-web then log session-close
commit

# View logs
show log messages | match RT_FLOW
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern 3: Policy Order Audit&lt;/h3&gt;
&lt;p&gt;Policies evaluate top-to-bottom. First match wins.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# See policy order
show security policies from-zone trust to-zone untrust

# Check if broad policy is shadowing specific one
# Policy &quot;deny-all&quot; at position 3 will never be reached if &quot;permit-any&quot; is at position 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reorder if needed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Move policy
insert security policies from-zone trust to-zone untrust policy specific-rule before policy broad-rule
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern 4: Global Policy Check&lt;/h3&gt;
&lt;p&gt;Don&apos;t forget global policies — they apply across all zone pairs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show security policies global

# Global policy might be permitting/denying before zone policy is evaluated
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;NAT Debugging&lt;/h2&gt;
&lt;h3&gt;Source NAT Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check NAT rule hit counts
show security nat source rule all

# Verify pool has addresses
show security nat source pool all

# Check if NAT is actually translating
show security flow session extensive | match &quot;NAT&quot;
# Look for: &quot;In: ... NAT: ... Out: ...&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Destination NAT Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check destination NAT rules
show security nat destination rule all

# Remember: policy must match POST-NAT destination!
# If NAT changes 203.0.113.10 → 10.0.1.100
# Policy destination-address must be 10.0.1.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verification Commands Summary&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# Zone verification
show security zones
show interfaces terse | match &quot;ge-&quot;

# Policy verification
show security policies from-zone X to-zone Y
show security policies hit-count
show security policies detail

# Session verification
show security flow session
show security flow session extensive
show security flow session count

# NAT verification
show security nat source rule all
show security nat destination rule all
show security nat source summary
show security nat destination summary

# Address book verification
show security address-book global
show security zones security-zone &amp;lt;zone&amp;gt; address-book

# Flow trace (use carefully)
show log flow-trace
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;SRX policy debugging follows a predictable pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Verify zones first&lt;/strong&gt; — most issues are zone mismatches&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Check session table&lt;/strong&gt; — does traffic create a session?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Look at hit counters&lt;/strong&gt; — which policy is matching (or not)?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use flow trace last&lt;/strong&gt; — performance impact, use surgically&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The processing order matters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Destination NAT → Routing → Policy → Source NAT&lt;/li&gt;
&lt;li&gt;Policy matches on post-DNAT, pre-SNAT addresses&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When stuck, simplify: broad &quot;any&quot; policy to confirm flow, then narrow down. Log everything. Read the trace output carefully.&lt;/p&gt;
&lt;p&gt;SRX is powerful but unforgiving. Understanding the flow means faster troubleshooting.&lt;/p&gt;
</content:encoded><category>juniper</category><category>firewall</category><category>security</category><category>troubleshooting</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>NAT Logging: Session Tracking for CGNAT and Compliance</title><link>https://ashimov.com/posts/vyos-nat-logging/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-nat-logging/</guid><description>Implement NAT session logging on VyOS. Covers connection tracking logs, log analysis, compliance requirements, and why NAT logs are essential for troubleshooting and legal requirements.</description><pubDate>Tue, 03 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;An abuse complaint arrives: &quot;IP 203.0.113.50 attacked our server at 14:32:15 UTC.&quot; But 203.0.113.50 is your NAT gateway, used by 500 subscribers. Who was it?&lt;/p&gt;
&lt;p&gt;Without NAT logging, you don&apos;t know. With NAT logging, you can trace 203.0.113.50:32768 at 14:32:15 back to internal IP 10.0.15.42 — subscriber John Smith. Now you can respond to the complaint.&lt;/p&gt;
&lt;p&gt;NAT logs are essential for troubleshooting and legal requirements.&lt;/p&gt;
&lt;h2&gt;Why Log NAT&lt;/h2&gt;
&lt;h3&gt;Abuse Handling&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Complaint: &quot;Attack from 203.0.113.50:32768 to victim:80 at 14:32:15&quot;

Without logs: &quot;We use NAT, could be anyone&quot;
With logs: &quot;That was 10.0.15.42 (subscriber #12345)&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Legal Requirements&lt;/h3&gt;
&lt;p&gt;Many jurisdictions require ISPs to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Log NAT translations&lt;/li&gt;
&lt;li&gt;Retain logs for specified period&lt;/li&gt;
&lt;li&gt;Provide data upon legal request&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Troubleshooting&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;User: &quot;My connection keeps dropping&quot;
NAT logs show: Frequent connection resets, port exhaustion
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Connection Tracking Logging&lt;/h2&gt;
&lt;h3&gt;Enable Conntrack Logging&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Log new connections
set system conntrack log new

# Log connection updates (optional, verbose)
set system conntrack log update

# Log connection destroy (useful for session duration)
set system conntrack log destroy

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Log Format&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Example log entries:
[NEW] tcp 6 120 SYN_SENT src=10.0.15.42 dst=93.184.216.34 sport=45678 dport=80 [UNREPLIED] src=93.184.216.34 dst=203.0.113.50 sport=80 dport=32768
[UPDATE] tcp 6 60 SYN_RECV src=10.0.15.42 dst=93.184.216.34 sport=45678 dport=80 src=93.184.216.34 dst=203.0.113.50 sport=80 dport=32768
[DESTROY] tcp 6 src=10.0.15.42 dst=93.184.216.34 sport=45678 dport=80 src=93.184.216.34 dst=203.0.113.50 sport=80 dport=32768
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Fields&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[NEW] - Connection start
[DESTROY] - Connection end

src=10.0.15.42   - Original source (internal)
dst=93.184.216.34 - Original destination
sport=45678      - Original source port

src=93.184.216.34 - Reply source
dst=203.0.113.50  - Reply destination (your NAT IP)
dport=32768       - NAT&apos;d port (what abuse reports show)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Syslog Configuration&lt;/h2&gt;
&lt;h3&gt;Send to Remote Syslog&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Send logs to remote syslog server
set system syslog host 10.0.0.100 facility kern level debug

# Or specific file locally
set system syslog file nat-log facility kern level debug

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Log Rotation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Configure logrotate for NAT logs
# /etc/logrotate.d/nat-logs

/var/log/messages {
    daily
    rotate 90    # Keep 90 days for compliance
    compress
    delaycompress
    notifempty
    create 640 root adm
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;NAT Session Table&lt;/h2&gt;
&lt;h3&gt;View Current Sessions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all NAT sessions
show nat source translations

# Or via conntrack
sudo conntrack -L

# Filter for specific internal IP
sudo conntrack -L -s 10.0.15.42

# Filter for specific NAT IP
sudo conntrack -L -r 203.0.113.50
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Real-Time Monitoring&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Watch new connections
sudo conntrack -E -e NEW

# Watch specific source
sudo conntrack -E -s 10.0.15.42
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Log Analysis&lt;/h2&gt;
&lt;h3&gt;Find Session by External Port&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Complaint: 203.0.113.50:32768 at 14:32:15

# Search logs
grep &quot;dport=32768&quot; /var/log/messages | grep &quot;14:32&quot;

# Output:
# Jan 8 14:32:15 router kernel: [NEW] tcp ... src=10.0.15.42 ... dport=32768
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Find All Sessions for Internal IP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Who was 10.0.15.42 talking to?
grep &quot;src=10.0.15.42&quot; /var/log/messages | grep &quot;\[NEW\]&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Session Statistics&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Connections per internal IP
grep &quot;\[NEW\]&quot; /var/log/messages | grep -oP &quot;src=\S+&quot; | sort | uniq -c | sort -rn | head

# Destinations per internal IP
grep &quot;src=10.0.15.42&quot; /var/log/messages | grep &quot;\[NEW\]&quot; | grep -oP &quot;dst=\S+&quot; | sort | uniq -c | sort -rn
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Log Storage and Retention&lt;/h2&gt;
&lt;h3&gt;Requirements&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;GDPR (EU): Not specifically defined, but purpose limitation applies
Various ISP regulations: 6 months to 2 years typical

Calculate storage:
- ~200 bytes per session
- 1000 sessions/second = 200 KB/s = 17 GB/day
- 90 days retention = 1.5 TB
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Efficient Storage&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use structured logging
# Compress old logs
# Consider log aggregation (ELK stack, etc.)

# Example: rsyslog to Elasticsearch
# /etc/rsyslog.d/nat-to-elastic.conf
module(load=&quot;omelasticsearch&quot;)

if $programname == &apos;kernel&apos; and $msg contains &apos;conntrack&apos; then {
    action(type=&quot;omelasticsearch&quot;
           server=&quot;elasticsearch.local&quot;
           serverport=&quot;9200&quot;
           template=&quot;nat-log-template&quot;)
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;CGNAT Specific Logging&lt;/h2&gt;
&lt;h3&gt;Extended Port Blocks&lt;/h3&gt;
&lt;p&gt;For CGNAT, logging individual connections is expensive. Alternative: log port block assignments.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Instead of logging every connection:
# [NEW] 10.0.15.42:12345 -&amp;gt; 93.184.216.34:80 via 203.0.113.50:32768

# Log port block assignment:
# 10.0.15.42 assigned ports 32768-34815 on 203.0.113.50 at 14:00:00
# 10.0.15.42 released ports 32768-34815 on 203.0.113.50 at 16:00:00

# Abuse lookup: 203.0.113.50:32768 at 14:32:15 was in 10.0.15.42&apos;s block
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VyOS CGNAT Logging&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# If using deterministic NAT (fixed mapping)
# Logging is simpler - mapping is predictable

# For dynamic NAT, full connection logging required

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Responding to Abuse Complaints&lt;/h2&gt;
&lt;h3&gt;Process&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Receive complaint with:
   - External IP (your NAT)
   - External port
   - Timestamp (UTC!)
   - Destination IP/port
   - Protocol

2. Convert timestamp to your timezone

3. Search logs:
   grep &quot;dport=&amp;lt;port&amp;gt;&quot; /var/log/nat.log | grep &quot;&amp;lt;timestamp&amp;gt;&quot;

4. Identify internal IP

5. Map internal IP to subscriber (from DHCP logs, etc.)

6. Respond to complaint with internal reference
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Response Template&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Reference: ABUSE-2025-0108-001
Complaint received: 2025-01-08

External IP: 203.0.113.50
External Port: 32768
Timestamp: 2025-01-08 14:32:15 UTC
Destination: victim.example.com:80

Investigation:
NAT logs show this connection originated from internal IP 10.0.15.42.
This corresponds to subscriber account #12345.

Action taken:
[Your action - warning, suspension, etc.]
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Performance Considerations&lt;/h2&gt;
&lt;h3&gt;Log Volume&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Logging every connection has cost:
# - CPU for logging
# - Disk I/O
# - Storage space

# For high-throughput NAT:
# Consider sampling
# Use dedicated log server
# Aggregate logs
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Log Only What&apos;s Needed&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Log only NEW events (not UPDATE/DESTROY)
set system conntrack log new

# DELETE logs useful for session duration, but doubles volume
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Security of Logs&lt;/h2&gt;
&lt;h3&gt;Access Control&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# NAT logs contain sensitive data
# - Who visited what
# - Privacy implications

# Restrict access
chmod 600 /var/log/nat.log
# Only authorized personnel

# Encrypt at rest
# Consider log integrity (signing)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Retention Policy&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Define retention period based on:
# - Legal requirements
# - Business needs
# - Privacy obligations

# Automatic purge after retention period
find /var/log/nat-archive -mtime +90 -delete
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Log to Remote Server&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Don&apos;t rely on local storage
# Remote syslog or log aggregation
set system syslog host 10.0.0.100 facility kern level debug
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Timestamp Accuracy&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use NTP for accurate timestamps
set system ntp server pool.ntp.org

# Complaints reference specific times
# Accuracy matters for correlation
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Test Log Search&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Before you need it:
# - Generate test traffic
# - Verify logs captured
# - Practice searching
# - Measure search time
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Document Process&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# NAT Log Lookup Procedure
1. Receive complaint
2. Verify timestamp timezone
3. Search command: grep &quot;dport=X&quot; /path/to/logs
4. Identify internal IP
5. Cross-reference with DHCP/subscriber database
6. Document findings
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;NAT logs are essential for troubleshooting and legal requirements.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Without NAT logs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Can&apos;t respond to abuse complaints&lt;/li&gt;
&lt;li&gt;Can&apos;t troubleshoot user issues&lt;/li&gt;
&lt;li&gt;May violate legal requirements&lt;/li&gt;
&lt;li&gt;&quot;It could be anyone&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With NAT logs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Trace any connection to internal source&lt;/li&gt;
&lt;li&gt;Respond to abuse with evidence&lt;/li&gt;
&lt;li&gt;Meet compliance requirements&lt;/li&gt;
&lt;li&gt;Debug NAT issues&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The overhead of logging is real but necessary. Plan storage, plan retention, and practice lookup before you need it during an incident.&lt;/p&gt;
&lt;p&gt;Your first abuse complaint is too late to set up logging.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>monitoring</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>FlowSpec: Programmable Filters via BGP</title><link>https://ashimov.com/posts/vyos-flowspec/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-flowspec/</guid><description>Understand BGP FlowSpec for traffic filtering. Covers FlowSpec rules, BGP distribution, rate limiting, and why FlowSpec enables network-wide filtering from a single point.</description><pubDate>Fri, 27 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Blocking an attacking IP requires logging into every border router, adding firewall rules, and hoping you don&apos;t make a typo. For 10 routers, that&apos;s 10 changes. During an attack, when speed matters most.&lt;/p&gt;
&lt;p&gt;BGP FlowSpec distributes filter rules via BGP. Define a rule once, BGP propagates it to all routers. Routers automatically install the filters. Network-wide blocking from a single point.&lt;/p&gt;
&lt;p&gt;FlowSpec enables network-wide filtering from a single control point.&lt;/p&gt;
&lt;h2&gt;What FlowSpec Does&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Traditional:
[Admin] → SSH → [Router 1] → add rule
        → SSH → [Router 2] → add rule
        → SSH → [Router 3] → add rule
        ... (manual, slow, error-prone)

FlowSpec:
[Admin] → [FlowSpec Controller] → BGP FlowSpec → [All Routers]
                                                  (automatic installation)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FlowSpec Components&lt;/h2&gt;
&lt;h3&gt;NLRI (Network Layer Reachability Information)&lt;/h3&gt;
&lt;p&gt;FlowSpec rules are BGP NLRIs that describe traffic:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Destination&lt;/td&gt;
&lt;td&gt;Destination prefix&lt;/td&gt;
&lt;td&gt;203.0.113.0/24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source&lt;/td&gt;
&lt;td&gt;Source prefix&lt;/td&gt;
&lt;td&gt;198.51.100.0/24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol&lt;/td&gt;
&lt;td&gt;IP protocol&lt;/td&gt;
&lt;td&gt;TCP, UDP, ICMP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Port&lt;/td&gt;
&lt;td&gt;L4 port&lt;/td&gt;
&lt;td&gt;80, 443, 53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fragment&lt;/td&gt;
&lt;td&gt;Fragmentation flags&lt;/td&gt;
&lt;td&gt;don&apos;t-fragment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packet length&lt;/td&gt;
&lt;td&gt;Packet size&lt;/td&gt;
&lt;td&gt;100-1500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DSCP&lt;/td&gt;
&lt;td&gt;DiffServ code point&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Actions&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traffic-rate&lt;/td&gt;
&lt;td&gt;Rate limit (0 = drop)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic-action&lt;/td&gt;
&lt;td&gt;Sample, terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redirect&lt;/td&gt;
&lt;td&gt;Send to specific VRF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic-marking&lt;/td&gt;
&lt;td&gt;Set DSCP&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;FlowSpec on VyOS&lt;/h2&gt;
&lt;p&gt;VyOS FlowSpec support depends on version and FRRouting capabilities.&lt;/p&gt;
&lt;h3&gt;Enable FlowSpec Address Family&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable FlowSpec for IPv4
set protocols bgp address-family ipv4-flowspec

# Configure neighbor for FlowSpec
set protocols bgp neighbor 10.255.0.1 address-family ipv4-flowspec

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Receive FlowSpec Rules&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Accept FlowSpec from upstream
set protocols bgp neighbor 10.255.0.1 address-family ipv4-flowspec

# Interface to apply FlowSpec rules
set protocols bgp address-family ipv4-flowspec local-install interface eth0

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FlowSpec Rule Examples&lt;/h2&gt;
&lt;h3&gt;Block Traffic to Destination&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Block all traffic to 203.0.113.100/32

# FlowSpec NLRI components:
# Destination: 203.0.113.100/32
# Action: traffic-rate 0 (drop)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Block Specific Port&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Block UDP port 53 to destination (DNS amplification)

# FlowSpec NLRI:
# Destination: 203.0.113.0/24
# Protocol: UDP (17)
# Destination port: 53
# Action: traffic-rate 0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rate Limit&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Rate limit ICMP to destination

# FlowSpec NLRI:
# Destination: 203.0.113.100/32
# Protocol: ICMP (1)
# Action: traffic-rate 1000000 (1 Mbps)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Block Source Network&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Block all traffic from attacking network

# FlowSpec NLRI:
# Source: 198.51.100.0/24
# Action: traffic-rate 0
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FlowSpec Controller&lt;/h2&gt;
&lt;h3&gt;ExaBGP for FlowSpec&lt;/h3&gt;
&lt;p&gt;ExaBGP can inject FlowSpec rules:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# exabgp.conf
neighbor 10.255.0.1 {
    router-id 10.255.0.100;
    local-address 10.255.0.100;
    local-as 65001;
    peer-as 65001;

    flow {
        # Block UDP 53 to victim
        route destination 203.0.113.100/32
              protocol udp
              destination-port 53
              rate-limit 0;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Inject Rule via API&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Using ExaBGP API
echo &quot;announce flow route destination 203.0.113.100/32 protocol tcp destination-port 80 rate-limit 0&quot; | socat - /var/run/exabgp.sock
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing FlowSpec&lt;/h2&gt;
&lt;h3&gt;Show Received Rules&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show FlowSpec routes
show bgp ipv4 flowspec

# Output:
# Flow     Destination         Protocol  Port    Action
# 1        203.0.113.100/32   UDP       53      rate-limit 0
# 2        203.0.113.0/24     TCP       80-443  rate-limit 1000000
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Installed Rules&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show rules installed on interface
show policy pbr flowspec interface eth0

# Or via iptables
sudo iptables -L -v -n | grep -i flowspec
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FlowSpec Validation&lt;/h2&gt;
&lt;h3&gt;Important Security Measures&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Only accept FlowSpec from trusted sources
# Validate FlowSpec rules don&apos;t affect unintended traffic

# Prefix validation
# FlowSpec destination should match prefixes you announce
# Prevents upstream from filtering traffic you didn&apos;t request
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Validation Mode&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable FlowSpec validation
set protocols bgp address-family ipv4-flowspec validation

# Only accept FlowSpec for your own prefixes

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FlowSpec Use Cases&lt;/h2&gt;
&lt;h3&gt;Use Case 1: DDoS Response&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Attack detected to 203.0.113.100
# Inject FlowSpec rule from controller

# Block all traffic (like RTBH but more granular)
# FlowSpec: destination 203.0.113.100/32, rate-limit 0

# Or block specific attack pattern
# FlowSpec: destination 203.0.113.100/32, protocol UDP, port 53, rate-limit 0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 2: Traffic Scrubbing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Redirect attack traffic to scrubbing center

# FlowSpec: destination 203.0.113.100/32, redirect VRF scrubbing

# Clean traffic sent back via normal routing
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 3: Rate Limiting&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Limit ICMP to all destinations (prevent ping flood amplification)

# FlowSpec: protocol ICMP, rate-limit 1000000
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 4: Network-Wide Policy&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Block entire protocol network-wide

# FlowSpec: protocol 47 (GRE), rate-limit 0
# All routers now block GRE
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FlowSpec vs RTBH&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;RTBH&lt;/th&gt;
&lt;th&gt;FlowSpec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Granularity&lt;/td&gt;
&lt;td&gt;/32 or /24&lt;/td&gt;
&lt;td&gt;5-tuple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actions&lt;/td&gt;
&lt;td&gt;Drop only&lt;/td&gt;
&lt;td&gt;Drop, rate-limit, redirect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;More complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support&lt;/td&gt;
&lt;td&gt;Wide&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;RTBH: Block everything to a destination
FlowSpec: Block specific traffic patterns to a destination
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Limitations&lt;/h2&gt;
&lt;h3&gt;Hardware Support&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;FlowSpec requires:
- BGP implementation supporting FlowSpec
- Data plane capable of implementing rules
- Sufficient TCAM/memory for rules

VyOS (software router):
- FlowSpec implemented via iptables/nftables
- Works but limited by CPU performance
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rule Complexity&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;More rules = More processing
Complex rules = Harder to manage
Watch for:
- Too many concurrent rules
- Overlapping rules
- Stale rules (remove after attack)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Start Simple&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Begin with basic destination blocks
# Add complexity as needed
# Test before production use
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Automate Rule Management&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use FlowSpec controller
# Integrate with detection systems
# Automatic rule addition and removal
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Set Timeouts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# FlowSpec rules should expire
# Don&apos;t leave blocking rules forever
# Implement automatic cleanup
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Document and Alert&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Log all FlowSpec changes
# Alert team when rules added
# Review rules regularly
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;FlowSpec enables network-wide filtering from a single control point.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Traditional filtering:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Manual changes on each router&lt;/li&gt;
&lt;li&gt;Slow during attacks&lt;/li&gt;
&lt;li&gt;Error-prone&lt;/li&gt;
&lt;li&gt;Inconsistent&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;FlowSpec:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Define once, propagate everywhere&lt;/li&gt;
&lt;li&gt;Fast deployment via BGP&lt;/li&gt;
&lt;li&gt;Consistent across network&lt;/li&gt;
&lt;li&gt;Can be automated&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VyOS FlowSpec support varies by version. For production use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Verify feature support&lt;/li&gt;
&lt;li&gt;Test thoroughly&lt;/li&gt;
&lt;li&gt;Have fallback (manual rules, RTBH)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;FlowSpec is powerful but complex. Start with simple rules, build automation, expand as you gain experience.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>networking</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>RTBH: Remote Triggered Blackhole Routing</title><link>https://ashimov.com/posts/vyos-rtbh/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-rtbh/</guid><description>Implement RTBH on VyOS for DDoS mitigation. Covers blackhole routing, BGP communities, triggering procedures, and why RTBH sacrifices the target to save the network.</description><pubDate>Tue, 24 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A massive DDoS attack is saturating your upstream link. Your entire network is affected because one target is receiving gigabits of attack traffic. You can&apos;t filter it — there&apos;s too much. You can&apos;t absorb it — your link is overwhelmed.&lt;/p&gt;
&lt;p&gt;RTBH (Remote Triggered Blackhole) tells your upstream provider: &quot;Drop all traffic to this IP.&quot; The attack traffic is discarded at the upstream, before it reaches your network. Your target is offline, but your network survives.&lt;/p&gt;
&lt;p&gt;RTBH sacrifices the target to save the network.&lt;/p&gt;
&lt;h2&gt;How RTBH Works&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Normal:
Attack ─→ [ISP] ─→ [Your Router] ─→ [Victim Server]
                   │ Link saturated
                   └─→ [Other Servers] (collateral damage)

With RTBH:
Attack ─→ [ISP] ─✕─ (traffic blackholed)

         [Your Router] ─→ [Victim Server] (unreachable, but attack stopped)
                      └─→ [Other Servers] (working normally)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;RTBH Setup&lt;/h2&gt;
&lt;h3&gt;Prerequisites&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;BGP session with upstream provider&lt;/li&gt;
&lt;li&gt;Agreement on blackhole community (e.g., ISP_ASN:666)&lt;/li&gt;
&lt;li&gt;Prefix you can announce (/32 or /24 depending on ISP)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Blackhole Route&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create blackhole next-hop
set protocols static route 192.0.2.1/32 blackhole

# This creates a null route locally
# Packets to 192.0.2.1 are dropped

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure BGP to Announce Blackhole&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define blackhole community (check with your ISP)
set policy community-list ISP-BLACKHOLE rule 10 regex &quot;65000:666&quot;

# Route map for blackhole announcements
set policy route-map BLACKHOLE-OUT rule 10 action permit
set policy route-map BLACKHOLE-OUT rule 10 match ip address prefix-list BLACKHOLE-PREFIXES
set policy route-map BLACKHOLE-OUT rule 10 set community &quot;65000:666&quot;
set policy route-map BLACKHOLE-OUT rule 10 set origin igp

# Regular announcements
set policy route-map BLACKHOLE-OUT rule 20 action permit

# Apply to BGP neighbor
set protocols bgp neighbor 10.0.0.1 address-family ipv4-unicast route-map export BLACKHOLE-OUT

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Triggering RTBH&lt;/h2&gt;
&lt;h3&gt;Manual Trigger&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add victim IP to blackhole prefix list
set policy prefix-list BLACKHOLE-PREFIXES rule 10 prefix 203.0.113.100/32
set policy prefix-list BLACKHOLE-PREFIXES rule 10 action permit

# Ensure route exists
set protocols static route 203.0.113.100/32 blackhole

commit

# ISP receives announcement with blackhole community
# ISP drops all traffic to 203.0.113.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Remove Blackhole&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Remove from prefix list
delete policy prefix-list BLACKHOLE-PREFIXES rule 10

# Remove blackhole route
delete protocols static route 203.0.113.100/32

commit

# ISP removes blackhole, traffic flows again
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Trigger Router Architecture&lt;/h2&gt;
&lt;h3&gt;Dedicated Trigger Router&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;                              ┌─────────────────┐
                              │ Trigger Router  │
                              │ (announces /32) │
                              └────────┬────────┘
                                       │ iBGP
┌────────────────────────────────────────────────────────────┐
│                    Your Network                            │
│  [Border1] ════════════════════════════════ [Border2]      │
│      │                                           │         │
│      └───────────────[ISP A]───────────────────┘           │
└────────────────────────────────────────────────────────────┘

Trigger router announces /32 with blackhole community
Border routers learn and propagate to ISP
ISP drops traffic to the /32
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Trigger Router Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Trigger router (separate from border routers)
set protocols bgp system-as 65001
set protocols bgp router-id 10.255.0.100

# iBGP to border routers
set protocols bgp neighbor 10.255.0.1 remote-as 65001
set protocols bgp neighbor 10.255.0.2 remote-as 65001

# Blackhole routes announced via iBGP
# Border routers then announce to ISP with community

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Destination-Based vs Source-Based RTBH&lt;/h2&gt;
&lt;h3&gt;Destination-Based (Common)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Drop traffic TO the victim
# Victim is unreachable but network saved

set protocols static route 203.0.113.100/32 blackhole
# Announce 203.0.113.100/32 with blackhole community
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Source-Based (If ISP Supports)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Drop traffic FROM attacker
# Victim remains reachable
# Requires ISP support for S-RTBH

# Much less common
# Check with your specific ISP
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Automation&lt;/h2&gt;
&lt;h3&gt;Trigger Script&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/trigger-rtbh.sh

ACTION=$1
TARGET=$2

VYOS_API=&quot;https://localhost/api&quot;
API_KEY=&quot;your-api-key&quot;

case $ACTION in
    add)
        # Add blackhole route
        vtysh -c &quot;configure terminal&quot; \
              -c &quot;ip route $TARGET/32 blackhole&quot;

        # Add to prefix list
        # (requires API or direct config manipulation)
        echo &quot;Blackhole triggered for $TARGET&quot;
        ;;
    remove)
        vtysh -c &quot;configure terminal&quot; \
              -c &quot;no ip route $TARGET/32 blackhole&quot;
        echo &quot;Blackhole removed for $TARGET&quot;
        ;;
esac
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Integration with Monitoring&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/auto-rtbh.sh

# Monitor traffic to critical IPs
# If threshold exceeded, trigger RTBH

THRESHOLD_PPS=1000000  # 1M pps

for ip in $(cat /config/protected-ips.txt); do
    PPS=$(get_pps_to_ip $ip)  # Your monitoring tool

    if [ $PPS -gt $THRESHOLD_PPS ]; then
        /config/scripts/trigger-rtbh.sh add $ip
        alert_team &quot;RTBH triggered for $ip (${PPS} pps)&quot;
    fi
done
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ISP Community Reference&lt;/h2&gt;
&lt;h3&gt;Common Blackhole Communities&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Format: ISP_ASN:666 (common convention)

# Check with your specific ISP
# Examples (verify before use):
# Level3:    3356:666 or 3356:9999
# NTT:       2914:666
# Cogent:    174:666
# Your ISP:  Check their BGP community documentation
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Multiple Upstreams&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Different community per upstream
set policy route-map BLACKHOLE-OUT-ISP1 rule 10 set community &quot;65001:666&quot;
set policy route-map BLACKHOLE-OUT-ISP2 rule 10 set community &quot;65002:666&quot;

# Apply to respective neighbors
set protocols bgp neighbor 10.0.0.1 address-family ipv4-unicast route-map export BLACKHOLE-OUT-ISP1
set protocols bgp neighbor 10.0.1.1 address-family ipv4-unicast route-map export BLACKHOLE-OUT-ISP2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verification&lt;/h2&gt;
&lt;h3&gt;Check Local Blackhole&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify blackhole route installed
show ip route 203.0.113.100

# Should show:
# B&amp;gt;* 203.0.113.100/32 [20/0] unreachable (blackhole), 00:05:00
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check BGP Announcement&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify route is being announced
show bgp ipv4 unicast 203.0.113.100/32

# Check communities
show bgp ipv4 unicast 203.0.113.100/32 community

# Should show blackhole community attached
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check with ISP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Look at ISP&apos;s looking glass
# Verify they received the announcement
# Verify they&apos;re applying blackhole

# Most ISPs have looking glass tools
# Check route presence and community
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Risks and Considerations&lt;/h2&gt;
&lt;h3&gt;Victim Becomes Unreachable&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;RTBH drops ALL traffic to victim:
- Attack traffic: dropped ✓
- Legitimate traffic: dropped ✓

Victim is completely offline during RTBH
Only use when alternative is worse (entire network down)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prefix Length Requirements&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Many ISPs only accept /24 or shorter
# /32 announcements may be filtered

# Options:
# 1. ISP accepts /32 with blackhole community (best)
# 2. Announce covering /24 (affects more IPs)
# 3. Use ISP-specific RTBH mechanism
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP Propagation Time&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Trigger RTBH → BGP updates propagate → ISP applies blackhole

Typical: 30 seconds to few minutes
During this time, attack still reaches you
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Document Procedure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# RTBH Trigger Procedure

## When to Use
- Attack saturating upstream link
- Collateral damage to other services
- Manual filtering impossible

## Steps
1. Confirm attack target IP
2. Notify team/management
3. Execute trigger script
4. Verify with ISP
5. Monitor network recovery
6. Remove blackhole when attack stops

## Contacts
- ISP NOC: +1-xxx-xxx-xxxx
- Internal: @security-team
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Test Before You Need It&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Test with non-critical IP
# Verify ISP accepts and applies blackhole
# Measure propagation time
# Test removal procedure
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Have Rollback Ready&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Keep removal procedure ready
# Time-limit blackholes (auto-remove)
# Monitor for attack cessation
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Combine with Other Measures&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;RTBH: Nuclear option, saves network
Also use:
- Rate limiting (smaller attacks)
- Upstream scrubbing (sophisticated attacks)
- CDN/DDoS protection (application layer)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;RTBH sacrifices the target to save the network.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When to use RTBH:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Attack larger than your link can handle&lt;/li&gt;
&lt;li&gt;Collateral damage affecting entire network&lt;/li&gt;
&lt;li&gt;No other option available&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When NOT to use RTBH:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Attack is manageable with rate limiting&lt;/li&gt;
&lt;li&gt;Upstream scrubbing is available&lt;/li&gt;
&lt;li&gt;Target availability is critical&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;RTBH is the last resort. It works by making the victim unreachable to everyone — attackers and legitimate users alike. Use it when the alternative (entire network down) is worse.&lt;/p&gt;
&lt;p&gt;Have it configured and tested before you need it. During an attack is not the time to learn RTBH.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>DDoS Mitigation at the Edge: Rate Limiting and Traffic Scrubbing</title><link>https://ashimov.com/posts/vyos-ddos/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-ddos/</guid><description>Implement basic DDoS protection on VyOS edge routers. Covers rate limiting, connection limits, SYN flood protection, and why edge mitigation buys time for upstream solutions.</description><pubDate>Fri, 20 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Your upstream link is 1 Gbps. The attack is 10 Gbps. Your edge router can&apos;t help — the link is already saturated before packets reach you.&lt;/p&gt;
&lt;p&gt;Edge DDoS mitigation isn&apos;t about stopping massive volumetric attacks. It&apos;s about protecting against smaller attacks, reducing collateral damage, and buying time until upstream mitigation kicks in.&lt;/p&gt;
&lt;p&gt;Edge mitigation buys time. It&apos;s not a complete solution, but it&apos;s better than nothing.&lt;/p&gt;
&lt;h2&gt;What Edge Routers Can Do&lt;/h2&gt;
&lt;h3&gt;Effective Against&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Application-layer attacks (HTTP, DNS)&lt;/li&gt;
&lt;li&gt;SYN floods (up to link capacity)&lt;/li&gt;
&lt;li&gt;Slowloris-style attacks&lt;/li&gt;
&lt;li&gt;Amplification from your network&lt;/li&gt;
&lt;li&gt;Small-scale volumetric attacks&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Not Effective Against&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Attacks larger than your upstream link&lt;/li&gt;
&lt;li&gt;Sophisticated distributed attacks&lt;/li&gt;
&lt;li&gt;Attacks that saturate your ISP&apos;s network&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Rate Limiting&lt;/h2&gt;
&lt;h3&gt;Basic Rate Limiting with Firewall&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Rate limit incoming connections per source
set firewall ipv4 name WAN-IN rule 50 action drop
set firewall ipv4 name WAN-IN rule 50 recent count 100
set firewall ipv4 name WAN-IN rule 50 recent time minute
set firewall ipv4 name WAN-IN rule 50 state new
set firewall ipv4 name WAN-IN rule 50 description &quot;Rate limit: 100 new conn/min/source&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rate Limit Specific Services&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Rate limit SSH connections
set firewall ipv4 name WAN-LOCAL rule 100 action drop
set firewall ipv4 name WAN-LOCAL rule 100 destination port 22
set firewall ipv4 name WAN-LOCAL rule 100 protocol tcp
set firewall ipv4 name WAN-LOCAL rule 100 recent count 5
set firewall ipv4 name WAN-LOCAL rule 100 recent time minute
set firewall ipv4 name WAN-LOCAL rule 100 state new
set firewall ipv4 name WAN-LOCAL rule 100 description &quot;SSH: Max 5 new conn/min/source&quot;

# Allow SSH that passes rate limit
set firewall ipv4 name WAN-LOCAL rule 101 action accept
set firewall ipv4 name WAN-LOCAL rule 101 destination port 22
set firewall ipv4 name WAN-LOCAL rule 101 protocol tcp
set firewall ipv4 name WAN-LOCAL rule 101 state new

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rate Limit DNS Queries&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Protect DNS server from amplification abuse
set firewall ipv4 name WAN-IN rule 200 action drop
set firewall ipv4 name WAN-IN rule 200 destination port 53
set firewall ipv4 name WAN-IN rule 200 protocol udp
set firewall ipv4 name WAN-IN rule 200 recent count 50
set firewall ipv4 name WAN-IN rule 200 recent time second
set firewall ipv4 name WAN-IN rule 200 description &quot;DNS: Max 50 queries/sec/source&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Connection Limits&lt;/h2&gt;
&lt;h3&gt;Limit Concurrent Connections&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Limit connections per source IP
set firewall ipv4 name WAN-IN rule 60 action drop
set firewall ipv4 name WAN-IN rule 60 conntrack connection-limit source-mask 32
set firewall ipv4 name WAN-IN rule 60 conntrack connection-limit count 100
set firewall ipv4 name WAN-IN rule 60 state new
set firewall ipv4 name WAN-IN rule 60 description &quot;Max 100 concurrent connections/IP&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Conntrack Table Protection&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Increase conntrack table size
set system conntrack table-size 524288

# Aggressive timeouts during attack
set system conntrack timeout tcp time-wait 30
set system conntrack timeout tcp close 10
set system conntrack timeout udp other 30

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;SYN Flood Protection&lt;/h2&gt;
&lt;h3&gt;SYN Cookies&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable SYN cookies (usually enabled by default)
configure
set system sysctl parameter net.ipv4.tcp_syncookies value 1
commit

# SYN cookies allow handling SYN floods without conntrack exhaustion
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SYN Rate Limiting&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Limit SYN packets per source
set firewall ipv4 name WAN-IN rule 70 action drop
set firewall ipv4 name WAN-IN rule 70 protocol tcp
set firewall ipv4 name WAN-IN rule 70 tcp flags syn
set firewall ipv4 name WAN-IN rule 70 recent count 20
set firewall ipv4 name WAN-IN rule 70 recent time second
set firewall ipv4 name WAN-IN rule 70 description &quot;SYN flood protection&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Invalid Packet Dropping&lt;/h2&gt;
&lt;h3&gt;Drop Malformed Packets&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Drop invalid state packets
set firewall ipv4 name WAN-IN rule 1 action drop
set firewall ipv4 name WAN-IN rule 1 state invalid
set firewall ipv4 name WAN-IN rule 1 description &quot;Drop invalid packets&quot;

# Drop fragments (often used in attacks)
set firewall ipv4 name WAN-IN rule 2 action drop
set firewall ipv4 name WAN-IN rule 2 fragment match-frag
set firewall ipv4 name WAN-IN rule 2 description &quot;Drop fragments&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;TCP Flag Validation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Drop XMAS scan
set firewall ipv4 name WAN-IN rule 3 action drop
set firewall ipv4 name WAN-IN rule 3 protocol tcp
set firewall ipv4 name WAN-IN rule 3 tcp flags fin,psh,urg
set firewall ipv4 name WAN-IN rule 3 description &quot;Drop XMAS packets&quot;

# Drop NULL scan
set firewall ipv4 name WAN-IN rule 4 action drop
set firewall ipv4 name WAN-IN rule 4 protocol tcp
set firewall ipv4 name WAN-IN rule 4 tcp flags !syn,!ack,!fin,!rst,!urg,!psh
set firewall ipv4 name WAN-IN rule 4 description &quot;Drop NULL packets&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Source Address Validation&lt;/h2&gt;
&lt;h3&gt;Block Bogons&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Block RFC 1918 from WAN
set firewall group network-group BOGONS network 10.0.0.0/8
set firewall group network-group BOGONS network 172.16.0.0/12
set firewall group network-group BOGONS network 192.168.0.0/16
set firewall group network-group BOGONS network 127.0.0.0/8
set firewall group network-group BOGONS network 0.0.0.0/8

set firewall ipv4 name WAN-IN rule 5 action drop
set firewall ipv4 name WAN-IN rule 5 source group network-group BOGONS
set firewall ipv4 name WAN-IN rule 5 description &quot;Block bogon sources&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;uRPF (Unicast Reverse Path Forwarding)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable strict uRPF on WAN interface
set firewall interface eth0 in ipv4 urpf strict

# Loose mode (accepts if any route exists)
set firewall interface eth0 in ipv4 urpf loose

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Traffic Shaping (QoS)&lt;/h2&gt;
&lt;h3&gt;Prioritize Legitimate Traffic&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Traffic policy
set traffic-policy shaper WAN-OUT bandwidth 1gbit
set traffic-policy shaper WAN-OUT default bandwidth 50%
set traffic-policy shaper WAN-OUT default ceiling 100%
set traffic-policy shaper WAN-OUT default queue-type fair-queue

# High priority class
set traffic-policy shaper WAN-OUT class 10 bandwidth 30%
set traffic-policy shaper WAN-OUT class 10 ceiling 100%
set traffic-policy shaper WAN-OUT class 10 match SSH ip destination port 22
set traffic-policy shaper WAN-OUT class 10 match ICMP ip protocol icmp

# Apply to interface
set interfaces ethernet eth0 traffic-policy out WAN-OUT

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Emergency Response&lt;/h2&gt;
&lt;h3&gt;Quick Blocks During Attack&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Block specific attacking IP immediately
configure
set firewall ipv4 name WAN-IN rule 10 action drop
set firewall ipv4 name WAN-IN rule 10 source address 203.0.113.100
commit

# Block attacking network
set firewall ipv4 name WAN-IN rule 11 action drop
set firewall ipv4 name WAN-IN rule 11 source address 203.0.113.0/24
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Block by Country (GeoIP)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VyOS doesn&apos;t have native GeoIP
# Use external IP lists

# Download country IP ranges
# Add to firewall group
set firewall group network-group BLOCKED-COUNTRY network x.x.x.x/xx
# ... many entries

set firewall ipv4 name WAN-IN rule 20 action drop
set firewall ipv4 name WAN-IN rule 20 source group network-group BLOCKED-COUNTRY
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring During Attack&lt;/h2&gt;
&lt;h3&gt;Watch Connection States&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Monitor conntrack
watch -n 1 &apos;cat /proc/sys/net/netfilter/nf_conntrack_count&apos;

# Show connections by source
sudo conntrack -L | awk &apos;{print $5}&apos; | cut -d= -f2 | sort | uniq -c | sort -rn | head -20

# Show firewall counters
watch -n 1 &apos;show firewall&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Traffic Analysis&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Monitor interface traffic
watch -n 1 &apos;show interfaces ethernet eth0&apos;

# Capture attack traffic
sudo tcpdump -i eth0 -c 1000 -w /tmp/attack.pcap

# Quick packet rate estimate
timeout 10 tcpdump -i eth0 -c 10000 2&amp;gt;&amp;amp;1 | tail -1
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What To Do When Overwhelmed&lt;/h2&gt;
&lt;h3&gt;1. Contact Upstream Provider&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Your ISP can:
# - Apply upstream ACLs
# - Activate DDoS scrubbing
# - Null route attacking traffic

# Have their NOC number ready!
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Enable Upstream Blackhole&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Advertise your prefix with blackhole community
# Traffic dropped at ISP, saves your link

# See RTBH article for details
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Use DDoS Protection Service&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Services like Cloudflare, Akamai, AWS Shield:
- Route traffic through their scrubbing centers
- They absorb attack, send clean traffic to you
- Works for attacks much larger than your capacity
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Prepare Before Attack&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Have emergency playbook ready
# Know your upstream NOC contact
# Pre-configure blocking rules (disabled)
# Monitor baseline traffic patterns
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Layer Your Defense&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Layer 1: Upstream ISP (volumetric)
Layer 2: Edge router (smaller attacks)
Layer 3: Application firewall (app-layer)
Layer 4: Application hardening
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Automate Response&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Script to block high-traffic sources
#!/bin/bash
# /config/scripts/auto-block.sh

THRESHOLD=1000  # connections
for ip in $(sudo conntrack -L | awk &apos;{print $5}&apos; | cut -d= -f2 | sort | uniq -c | awk -v t=$THRESHOLD &apos;$1 &amp;gt; t {print $2}&apos;); do
    echo &quot;Blocking $ip&quot;
    # Add firewall rule
done
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Edge mitigation buys time. It&apos;s not a complete solution.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;What edge routers can do:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rate limit connections&lt;/li&gt;
&lt;li&gt;Drop invalid traffic&lt;/li&gt;
&lt;li&gt;Block known attackers&lt;/li&gt;
&lt;li&gt;Protect specific services&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What edge routers can&apos;t do:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Stop attacks larger than your pipe&lt;/li&gt;
&lt;li&gt;Replace upstream scrubbing&lt;/li&gt;
&lt;li&gt;Handle sophisticated multi-vector attacks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Build defense in depth:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upstream DDoS protection for volumetric&lt;/li&gt;
&lt;li&gt;Edge rate limiting for application-layer&lt;/li&gt;
&lt;li&gt;Application hardening for everything else&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The edge router is one layer. Make it effective, but don&apos;t rely on it alone.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Dynamic Routing Over Tunnels: BGP and OSPF Through Encrypted Links</title><link>https://ashimov.com/posts/vyos-tunnel-routing/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-tunnel-routing/</guid><description>Run routing protocols over VPN tunnels on VyOS. Covers OSPF over GRE/IPsec, BGP over WireGuard, tunnel interface selection, and why routing over tunnels requires careful planning.</description><pubDate>Tue, 17 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Static routes over VPN tunnels work until you have multiple tunnels, need failover, or manage complex topologies. Then you want routing protocols to handle the complexity.&lt;/p&gt;
&lt;p&gt;Running OSPF or BGP over tunnels adds resilience. If a tunnel goes down, the routing protocol detects it and converges to alternate paths. But tunnels add latency, may not support multicast, and need careful interface selection.&lt;/p&gt;
&lt;p&gt;Routing over tunnels requires careful planning — but it&apos;s worth it for resilient networks.&lt;/p&gt;
&lt;h2&gt;Tunnel Types and Routing Support&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tunnel Type&lt;/th&gt;
&lt;th&gt;OSPF&lt;/th&gt;
&lt;th&gt;BGP&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GRE&lt;/td&gt;
&lt;td&gt;Yes (multicast)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Full support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPsec VTI&lt;/td&gt;
&lt;td&gt;Yes (unicast)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No multicast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WireGuard&lt;/td&gt;
&lt;td&gt;Yes (unicast)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No multicast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenVPN&lt;/td&gt;
&lt;td&gt;Yes (unicast)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No multicast&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Multicast vs Unicast OSPF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# GRE supports multicast (normal OSPF)
set protocols ospf interface tun0 area 0

# IPsec VTI/WireGuard need unicast neighbors
set protocols ospf interface wg0 area 0
set protocols ospf neighbor 10.255.0.2  # Explicit neighbor
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;OSPF Over GRE&lt;/h2&gt;
&lt;h3&gt;GRE Tunnel Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# GRE tunnel
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30
set interfaces tunnel tun0 mtu 1476

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF over GRE (multicast works)
set protocols ospf interface tun0 area 0
set protocols ospf interface tun0 network point-to-point
set protocols ospf interface tun0 hello-interval 10
set protocols ospf interface tun0 dead-interval 40

# Advertise tunnel network
set protocols ospf area 0 network 10.255.0.0/30

# Advertise local networks
set protocols ospf area 0 network 192.168.1.0/24

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify OSPF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check neighbors
show ip ospf neighbor

# Should show neighbor via tun0
# Neighbor ID     Pri   State    Dead Time   Address      Interface
# 10.255.0.2      1     Full/-   00:00:32    10.255.0.2   tun0

# Check routes
show ip route ospf
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;OSPF Over IPsec (VTI)&lt;/h2&gt;
&lt;h3&gt;IPsec VTI Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# IPsec VTI (route-based VPN)
set vpn ipsec interface eth0
set vpn ipsec esp-group ESP proposal 1 encryption aes256gcm128
set vpn ipsec esp-group ESP proposal 1 hash sha256
set vpn ipsec ike-group IKE proposal 1 encryption aes256
set vpn ipsec ike-group IKE proposal 1 hash sha256
set vpn ipsec ike-group IKE proposal 1 dh-group 14

set vpn ipsec site-to-site peer 198.51.100.1 authentication mode pre-shared-secret
set vpn ipsec site-to-site peer 198.51.100.1 authentication pre-shared-secret &quot;secret&quot;
set vpn ipsec site-to-site peer 198.51.100.1 connection-type initiate
set vpn ipsec site-to-site peer 198.51.100.1 ike-group IKE
set vpn ipsec site-to-site peer 198.51.100.1 local-address 203.0.113.1
set vpn ipsec site-to-site peer 198.51.100.1 vti bind vti0
set vpn ipsec site-to-site peer 198.51.100.1 vti esp-group ESP

# VTI interface
set interfaces vti vti0 address 10.255.0.1/30
set interfaces vti vti0 mtu 1400

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF Over VTI (No Multicast)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF needs explicit neighbor (no multicast over IPsec)
set protocols ospf interface vti0 area 0
set protocols ospf interface vti0 network point-to-point
set protocols ospf neighbor 10.255.0.2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;OSPF Over WireGuard&lt;/h2&gt;
&lt;h3&gt;WireGuard Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# WireGuard interface
set interfaces wireguard wg0 address 10.255.0.1/30
set interfaces wireguard wg0 port 51820
set interfaces wireguard wg0 private-key &amp;lt;your-private-key&amp;gt;

# Peer configuration
set interfaces wireguard wg0 peer PEER1 public-key &amp;lt;peer-public-key&amp;gt;
set interfaces wireguard wg0 peer PEER1 allowed-ips 0.0.0.0/0
set interfaces wireguard wg0 peer PEER1 endpoint 198.51.100.1:51820
set interfaces wireguard wg0 peer PEER1 persistent-keepalive 25

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF Over WireGuard&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF with explicit neighbor
set protocols ospf interface wg0 area 0
set protocols ospf interface wg0 network point-to-point
set protocols ospf neighbor 10.255.0.2

# BFD for faster failover
set protocols ospf interface wg0 bfd

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BGP Over Tunnels&lt;/h2&gt;
&lt;p&gt;BGP is easier — it uses TCP unicast, works over any tunnel.&lt;/p&gt;
&lt;h3&gt;BGP Over WireGuard&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# WireGuard tunnel (as above)
# ...

# BGP over WireGuard
set protocols bgp system-as 65001
set protocols bgp neighbor 10.255.0.2 remote-as 65002
set protocols bgp neighbor 10.255.0.2 update-source wg0
set protocols bgp neighbor 10.255.0.2 address-family ipv4-unicast

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP Over IPsec VTI&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# IPsec VTI (as above)
# ...

# BGP over VTI
set protocols bgp system-as 65001
set protocols bgp neighbor 10.255.0.2 remote-as 65002
set protocols bgp neighbor 10.255.0.2 update-source vti0
set protocols bgp neighbor 10.255.0.2 address-family ipv4-unicast

# BFD for fast failover
set protocols bgp neighbor 10.255.0.2 bfd

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multi-Tunnel Design&lt;/h2&gt;
&lt;h3&gt;Hub and Spoke with OSPF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;          [Spoke1]
              │ wg1
        ┌─────┴─────┐
        │           │
       [Hub]        │
        │           │
        └─────┬─────┘
              │ wg2
          [Spoke2]
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Hub configuration
configure

# WireGuard to Spoke1
set interfaces wireguard wg1 address 10.255.1.1/30
set interfaces wireguard wg1 peer SPOKE1 ...

# WireGuard to Spoke2
set interfaces wireguard wg2 address 10.255.2.1/30
set interfaces wireguard wg2 peer SPOKE2 ...

# OSPF on both tunnels
set protocols ospf interface wg1 area 0
set protocols ospf interface wg1 network point-to-point
set protocols ospf neighbor 10.255.1.2

set protocols ospf interface wg2 area 0
set protocols ospf interface wg2 network point-to-point
set protocols ospf neighbor 10.255.2.2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Full Mesh with BGP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Each site peers with all others via BGP
# More configuration but better path selection

set protocols bgp neighbor 10.255.0.2 remote-as 65002
set protocols bgp neighbor 10.255.0.3 remote-as 65003
set protocols bgp neighbor 10.255.0.4 remote-as 65004
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Fast Failover&lt;/h2&gt;
&lt;h3&gt;BFD Over Tunnels&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BFD for fast tunnel failure detection
set protocols bfd peer 10.255.0.2 source address 10.255.0.1
set protocols bfd peer 10.255.0.2 interval transmit 300
set protocols bfd peer 10.255.0.2 interval receive 300
set protocols bfd peer 10.255.0.2 interval multiplier 3

# Link BFD to OSPF
set protocols ospf interface wg0 bfd

# Or link to BGP
set protocols bgp neighbor 10.255.0.2 bfd

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Tunnel Keepalives&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# WireGuard persistent keepalive
set interfaces wireguard wg0 peer PEER1 persistent-keepalive 25

# GRE keepalives
set interfaces tunnel tun0 parameters ip keepalive interval 10
set interfaces tunnel tun0 parameters ip keepalive failure-count 3
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Routing Protocol Not Forming&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check tunnel is up
show interfaces wireguard
ping 10.255.0.2

# Check routing protocol
show ip ospf neighbor
show bgp summary

# Check firewall allows protocol traffic
# OSPF: Protocol 89
# BGP: TCP 179
# BFD: UDP 3784/3785
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Routes Not Propagating&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check route advertisement
show ip ospf database
show bgp ipv4 unicast

# Verify network statements
show configuration commands | grep &quot;protocols ospf area&quot;
show configuration commands | grep &quot;protocols bgp&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Asymmetric Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Traffic goes out tunnel, returns via different path

# Ensure consistent costs
set protocols ospf interface wg0 cost 100
set protocols ospf interface wg1 cost 100

# Or use BGP with consistent metrics
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Use Point-to-Point Network Type&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# For tunnel interfaces
set protocols ospf interface wg0 network point-to-point

# Saves DR election overhead
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Enable BFD&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Fast failure detection
set protocols ospf interface wg0 bfd
# Or
set protocols bgp neighbor 10.255.0.2 bfd
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Set Appropriate Costs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Higher cost for slower/less reliable tunnels
set protocols ospf interface wg0 cost 100  # Fast tunnel
set protocols ospf interface wg1 cost 200  # Backup tunnel
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Consider MTU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Ensure routing protocol packets fit
set interfaces wireguard wg0 mtu 1420

# Or enable fragmentation on underlay
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Routing over tunnels requires careful planning — but it&apos;s worth it for resilient networks.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automatic failover on tunnel failure&lt;/li&gt;
&lt;li&gt;Dynamic path selection&lt;/li&gt;
&lt;li&gt;Consistent with non-tunnel routing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tunnel type affects protocol support&lt;/li&gt;
&lt;li&gt;MTU requires attention&lt;/li&gt;
&lt;li&gt;Convergence time adds to tunnel detection time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Key decisions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Protocol&lt;/strong&gt;: OSPF (simple) or BGP (more control)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failover&lt;/strong&gt;: BFD required for fast detection&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Topology&lt;/strong&gt;: Hub-spoke vs full mesh&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tunnel type&lt;/strong&gt;: Affects multicast support&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Done right, routing over tunnels gives you a resilient VPN mesh that handles failures automatically. Done wrong, you get mysterious routing issues and slow failover.&lt;/p&gt;
&lt;p&gt;Plan it. Test it. Monitor it.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>ospf</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>VXLAN: Scalable L2 Over L3 Overlay</title><link>https://ashimov.com/posts/vyos-vxlan/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-vxlan/</guid><description>Configure VXLAN on VyOS for datacenter overlays. Covers VXLAN concepts, static and multicast modes, head-end replication, MTU, and why VXLAN enables scalable Layer 2 networks.</description><pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;VLANs scale to 4094. That&apos;s not enough for large datacenters with thousands of tenants. VLAN tags are local to Layer 2 domains. Extending VLANs across L3 boundaries requires complex tricks.&lt;/p&gt;
&lt;p&gt;VXLAN (Virtual Extensible LAN) encapsulates Ethernet frames in UDP. 24-bit VNI supports 16 million segments. Runs over any IP network. Decouples overlay from underlay.&lt;/p&gt;
&lt;p&gt;VXLAN enables scalable Layer 2 networks over any IP infrastructure.&lt;/p&gt;
&lt;h2&gt;VXLAN Concepts&lt;/h2&gt;
&lt;h3&gt;How VXLAN Works&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[Host A]──[VTEP1]═══ IP Network ═══[VTEP2]──[Host B]
           │                          │
    Encapsulate in UDP         Decapsulate
    (add VXLAN header)         (remove header)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VXLAN Header&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Outer Ethernet │ Outer IP │ Outer UDP │ VXLAN │ Inner Ethernet │ Inner IP │ Payload
               │          │  dst 4789 │ VNI   │                │          │
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Terms&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VNI&lt;/td&gt;
&lt;td&gt;VXLAN Network Identifier (24-bit, up to 16M)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VTEP&lt;/td&gt;
&lt;td&gt;VXLAN Tunnel Endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVE&lt;/td&gt;
&lt;td&gt;Network Virtualization Edge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BUM&lt;/td&gt;
&lt;td&gt;Broadcast, Unknown unicast, Multicast&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Basic VXLAN Configuration&lt;/h2&gt;
&lt;h3&gt;Static VXLAN (Point-to-Point)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create VXLAN interface
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 remote 10.0.0.2
set interfaces vxlan vxlan100 port 4789

# Bridge VXLAN with local interface
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Remote Side&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Mirror configuration, swap source/remote
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.2
set interfaces vxlan vxlan100 remote 10.0.0.1
set interfaces vxlan vxlan100 port 4789

set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multicast VXLAN&lt;/h2&gt;
&lt;p&gt;For multi-point VXLAN using multicast for BUM traffic:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# VXLAN with multicast group
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 group 239.1.1.100
set interfaces vxlan vxlan100 port 4789

# Bridge configuration
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Multicast Requirements&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Underlay must support multicast routing
# Enable PIM on underlay interfaces
set protocols pim interface eth0

# Or use static IGMP membership
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Head-End Replication (Unicast Mode)&lt;/h2&gt;
&lt;p&gt;No multicast required — VTEP replicates BUM to all remote VTEPs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# VXLAN with multiple remote VTEPs
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 remote 10.0.0.2
set interfaces vxlan vxlan100 remote 10.0.0.3
set interfaces vxlan vxlan100 remote 10.0.0.4
set interfaces vxlan vxlan100 port 4789

# BUM traffic is replicated to all remotes

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scaling Consideration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Multicast: Efficient BUM delivery, requires multicast underlay
Unicast:   Simple, but BUM traffic multiplied by VTEP count

Small scale (few VTEPs): Unicast fine
Large scale: Multicast or EVPN control plane
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VXLAN with EVPN&lt;/h2&gt;
&lt;p&gt;Best practice for production: EVPN control plane handles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MAC learning (no data plane flooding)&lt;/li&gt;
&lt;li&gt;Remote VTEP discovery (no manual configuration)&lt;/li&gt;
&lt;li&gt;BUM optimization&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;configure

# VXLAN interface
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 parameters nolearning

# nolearning: Disable data plane MAC learning (EVPN handles it)

# BGP EVPN configuration
set protocols bgp address-family l2vpn-evpn advertise-all-vni

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MTU Considerations&lt;/h2&gt;
&lt;h3&gt;VXLAN Overhead&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Outer Ethernet:  14 bytes
Outer IP:        20 bytes
Outer UDP:        8 bytes
VXLAN header:     8 bytes
Total:           50 bytes

Standard 1500 MTU - 50 = 1450 inner MTU
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure MTU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Option 1: Increase underlay MTU
set interfaces ethernet eth0 mtu 1550

# Option 2: Reduce overlay MTU
set interfaces bridge br100 mtu 1450

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Jumbo Frames&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Better option: Use jumbo frames on underlay
set interfaces ethernet eth0 mtu 9000

# VXLAN inner MTU: 9000 - 50 = 8950
# Standard 1500 MTU traffic fits easily
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VXLAN Gateway&lt;/h2&gt;
&lt;h3&gt;L2 Gateway (Bridging Only)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VXLAN bridges to local VLAN
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1.100

# Local VLAN 100 traffic bridged to VXLAN 100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;L3 Gateway (Routing)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add IP to bridge for routing
set interfaces bridge br100 address 192.168.100.1/24

# VMs/hosts in VXLAN use this as gateway
# Router handles inter-VXLAN routing

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VXLAN Routing Between VNIs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Two VXLANs
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan200 vni 200

# Two bridges with IPs
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 address 192.168.100.1/24

set interfaces bridge br200 member interface vxlan200
set interfaces bridge br200 address 192.168.200.1/24

# Router routes between 192.168.100.0/24 and 192.168.200.0/24

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing VXLAN State&lt;/h2&gt;
&lt;h3&gt;Check Interface&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show VXLAN interface
show interfaces vxlan

# Show VXLAN details
show interfaces vxlan vxlan100

# Show bridge MAC table
show bridge fdb interface br100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Forwarding Database&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# View learned MACs
bridge fdb show dev vxlan100

# Output:
# aa:bb:cc:dd:ee:ff dev vxlan100 dst 10.0.0.2 self permanent
# 11:22:33:44:55:66 dev vxlan100 master br100
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting VXLAN&lt;/h2&gt;
&lt;h3&gt;Tunnel Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check VXLAN interface is up
show interfaces vxlan vxlan100

# Verify underlay connectivity
ping 10.0.0.2  # Remote VTEP

# Check UDP port 4789 is not filtered
nc -vzu 10.0.0.2 4789

# Capture VXLAN traffic
sudo tcpdump -i eth0 udp port 4789
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;No MAC Learning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check bridge FDB
bridge fdb show dev vxlan100

# If empty, check:
# - ARP traffic flowing
# - VXLAN interface in bridge
# - nolearning not set (unless using EVPN)

# Generate traffic to trigger learning
arping -I br100 192.168.100.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MTU Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptoms: Small packets work, large fail

# Test with large ping
ping -s 1400 192.168.100.100

# If fails, check MTU
ip link show vxlan100
ip link show br100

# Verify underlay MTU is sufficient
ping -s 1500 -M do 10.0.0.2
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Security Considerations&lt;/h2&gt;
&lt;h3&gt;VXLAN Has No Encryption&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Traffic visible to underlay network:
- Inner Ethernet frames
- All payload content

For sensitive data:
- Encrypt at application layer
- Use IPsec on underlay
- Consider alternative (WireGuard overlay)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Firewall VXLAN Traffic&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Allow VXLAN only from known VTEPs
set firewall ipv4 name UNDERLAY-IN rule 100 action accept
set firewall ipv4 name UNDERLAY-IN rule 100 protocol udp
set firewall ipv4 name UNDERLAY-IN rule 100 destination port 4789
set firewall ipv4 name UNDERLAY-IN rule 100 source group network-group VTEPS
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VXLAN Design Patterns&lt;/h2&gt;
&lt;h3&gt;Datacenter Fabric&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;         [Spine1]     [Spine2]
            │            │
    ────────┼────────────┼────────
    │       │       │    │       │
 [Leaf1] [Leaf2] [Leaf3] ...

Underlay: IP routing (OSPF/BGP) on spine-leaf
Overlay:  VXLAN between leaves
Control:  EVPN for MAC learning
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;DCI (Datacenter Interconnect)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[DC1]══════VXLAN══════[DC2]
       over WAN

Extended L2 between datacenters
Watch out for: Latency, BUM flooding, split-brain
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;VXLAN enables scalable Layer 2 networks over any IP infrastructure.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;VXLAN advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;16 million segments (vs 4K VLANs)&lt;/li&gt;
&lt;li&gt;Works over any IP network&lt;/li&gt;
&lt;li&gt;Decouples overlay from underlay&lt;/li&gt;
&lt;li&gt;Foundation for modern DC fabrics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VXLAN considerations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;50-byte overhead (needs MTU planning)&lt;/li&gt;
&lt;li&gt;BUM handling (multicast, unicast, or EVPN)&lt;/li&gt;
&lt;li&gt;No encryption (plan accordingly)&lt;/li&gt;
&lt;li&gt;Control plane important at scale&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For small deployments: Static VXLAN with head-end replication works.
For scale: EVPN control plane is the answer.&lt;/p&gt;
&lt;p&gt;VXLAN is infrastructure. The real magic is in the control plane (EVPN) and how you design the overlay-underlay interaction.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>overlay</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>GRE, IPIP, and SIT Tunnels: Simple Point-to-Point Encapsulation</title><link>https://ashimov.com/posts/vyos-gre-tunnels/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-gre-tunnels/</guid><description>Configure GRE, IPIP, and SIT tunnels on VyOS. Covers tunnel types, MTU considerations, keepalives, GRE keys, and why simple tunnels solve simple problems.</description><pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;VPNs like IPsec and WireGuard provide encryption. But sometimes you don&apos;t need encryption — just encapsulation. Connect two private networks over public internet without the complexity of key management.&lt;/p&gt;
&lt;p&gt;GRE, IPIP, and SIT are simple tunneling protocols. They wrap packets inside other packets. No encryption, minimal overhead, easy to set up. Use them when encapsulation is enough and encryption is handled elsewhere (or not needed).&lt;/p&gt;
&lt;p&gt;Simple tunnels solve simple problems.&lt;/p&gt;
&lt;h2&gt;Tunnel Types&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Full Name&lt;/th&gt;
&lt;th&gt;Encapsulates&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GRE&lt;/td&gt;
&lt;td&gt;Generic Routing Encapsulation&lt;/td&gt;
&lt;td&gt;Any protocol&lt;/td&gt;
&lt;td&gt;24 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPIP&lt;/td&gt;
&lt;td&gt;IP-in-IP&lt;/td&gt;
&lt;td&gt;IPv4 only&lt;/td&gt;
&lt;td&gt;20 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SIT&lt;/td&gt;
&lt;td&gt;Simple Internet Transition&lt;/td&gt;
&lt;td&gt;IPv6 in IPv4&lt;/td&gt;
&lt;td&gt;20 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Use Each&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;GRE:   Most flexible, multicast support, routing protocols
IPIP:  Minimal overhead, IPv4 only
SIT:   IPv6 tunneling over IPv4
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;GRE Tunnel Configuration&lt;/h2&gt;
&lt;h3&gt;Basic GRE Tunnel&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create GRE tunnel
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Both Ends Must Match&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Site A (203.0.113.1)
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

# Site B (198.51.100.1)
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 198.51.100.1
set interfaces tunnel tun0 remote 203.0.113.1
set interfaces tunnel tun0 address 10.255.0.2/30
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;GRE with Key&lt;/h3&gt;
&lt;p&gt;GRE key identifies tunnel (useful when multiple tunnels to same endpoint):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add GRE key
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 parameters ip key 12345
set interfaces tunnel tun0 address 10.255.0.1/30

commit

# Both ends must use same key
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;GRE Keepalives&lt;/h3&gt;
&lt;p&gt;Detect tunnel failure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable keepalives
set interfaces tunnel tun0 parameters ip keepalive interval 10
set interfaces tunnel tun0 parameters ip keepalive failure-count 3

# Tunnel goes down after 30 seconds of no response

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;IPIP Tunnel Configuration&lt;/h2&gt;
&lt;p&gt;Minimal overhead for IPv4-only:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create IPIP tunnel
set interfaces tunnel tun0 encapsulation ipip
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;IPIP vs GRE&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;IPIP: 20 bytes overhead, IPv4 only, no multicast
GRE:  24 bytes overhead, any protocol, multicast support

Use IPIP when:
- Only IPv4 needed
- Minimal overhead matters
- No routing protocols over tunnel

Use GRE when:
- Need multicast (OSPF, etc.)
- Need IPv6 over tunnel
- Need GRE key for identification
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;SIT Tunnel Configuration&lt;/h2&gt;
&lt;p&gt;IPv6 over IPv4 tunneling:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create SIT tunnel (6in4)
set interfaces tunnel tun0 encapsulation sit
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 2001:db8::1/64

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;6in4 Tunnel Example&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Site A: IPv4 203.0.113.1, wants IPv6 2001:db8:a::/48
# Site B: IPv4 198.51.100.1, wants IPv6 2001:db8:b::/48
# Tunnel addresses: 2001:db8:ffff::1/126 and ::2

# Site A
set interfaces tunnel tun0 encapsulation sit
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 2001:db8:ffff::1/126

set protocols static route6 2001:db8:b::/48 interface tun0

# Site B
set interfaces tunnel tun0 encapsulation sit
set interfaces tunnel tun0 source-address 198.51.100.1
set interfaces tunnel tun0 remote 203.0.113.1
set interfaces tunnel tun0 address 2001:db8:ffff::2/126

set protocols static route6 2001:db8:a::/48 interface tun0
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MTU Considerations&lt;/h2&gt;
&lt;h3&gt;Calculate Tunnel MTU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Outer IP header:  20 bytes
GRE header:       4 bytes (8 with key/seq)
Inner packet:     MTU - overhead

Standard Ethernet (1500):
- GRE: 1500 - 24 = 1476 MTU
- IPIP: 1500 - 20 = 1480 MTU
- SIT: 1500 - 20 = 1480 MTU
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Set Tunnel MTU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Set MTU on tunnel interface
set interfaces tunnel tun0 mtu 1476

# Important: Prevents fragmentation issues

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MSS Clamping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Clamp TCP MSS for traffic over tunnel
set firewall options interface tun0 adjust-mss 1436

# MSS = MTU - 40 (IP + TCP headers)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Routing Over Tunnels&lt;/h2&gt;
&lt;h3&gt;Static Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Route remote network via tunnel
set protocols static route 10.2.0.0/16 interface tun0

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Dynamic Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF over GRE (GRE supports multicast)
set protocols ospf interface tun0 area 0

# For IPIP (no multicast), use unicast neighbors
set protocols ospf interface tun0 area 0
set protocols ospf neighbor 10.255.0.2  # Explicit neighbor

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;GRE over IPsec&lt;/h2&gt;
&lt;p&gt;GRE for routing + IPsec for encryption:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# IPsec tunnel first
set vpn ipsec interface eth0
set vpn ipsec esp-group ESP-GRE proposal 1 encryption aes256
set vpn ipsec esp-group ESP-GRE proposal 1 hash sha256
set vpn ipsec ike-group IKE-GRE proposal 1 encryption aes256
set vpn ipsec ike-group IKE-GRE proposal 1 hash sha256

set vpn ipsec site-to-site peer 198.51.100.1 authentication mode pre-shared-secret
set vpn ipsec site-to-site peer 198.51.100.1 authentication pre-shared-secret &quot;secret&quot;
set vpn ipsec site-to-site peer 198.51.100.1 ike-group IKE-GRE
set vpn ipsec site-to-site peer 198.51.100.1 local-address 203.0.113.1
set vpn ipsec site-to-site peer 198.51.100.1 tunnel 1 esp-group ESP-GRE
set vpn ipsec site-to-site peer 198.51.100.1 tunnel 1 protocol gre

# GRE inside IPsec
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting Tunnels&lt;/h2&gt;
&lt;h3&gt;Tunnel Not Coming Up&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check interface status
show interfaces tunnel

# Verify source address is local
ip addr show | grep 203.0.113.1

# Check remote reachability (outer IP)
ping 198.51.100.1

# Check firewall allows GRE/IPIP
# GRE: Protocol 47
# IPIP: Protocol 4
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Traffic Not Flowing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check routing
show ip route

# Verify routes via tunnel
show ip route 10.2.0.0/16

# Test tunnel connectivity
ping 10.255.0.2  # Tunnel endpoint

# Check MTU
ping -s 1400 -M do 10.255.0.2  # Large packet with DF
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Capture Tunnel Traffic&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture on physical interface (encapsulated)
sudo tcpdump -i eth0 proto gre
sudo tcpdump -i eth0 proto 4  # IPIP

# Capture on tunnel interface (inner packets)
sudo tcpdump -i tun0
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Security Considerations&lt;/h2&gt;
&lt;h3&gt;GRE/IPIP Have No Encryption&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Traffic visible to anyone on path:
- Inner IP addresses
- Protocol information
- Payload content

For sensitive data:
- Use GRE over IPsec
- Use WireGuard/IPsec instead
- Encrypt at application layer
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Firewall GRE at Ingress&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Only allow GRE from known peer
set firewall ipv4 name WAN-IN rule 100 action accept
set firewall ipv4 name WAN-IN rule 100 protocol gre
set firewall ipv4 name WAN-IN rule 100 source address 198.51.100.1

set firewall ipv4 name WAN-IN rule 999 action drop
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multiple Tunnels&lt;/h2&gt;
&lt;h3&gt;To Same Remote&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use GRE key to distinguish
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 parameters ip key 1
set interfaces tunnel tun0 address 10.255.0.1/30

set interfaces tunnel tun1 encapsulation gre
set interfaces tunnel tun1 remote 198.51.100.1
set interfaces tunnel tun1 parameters ip key 2
set interfaces tunnel tun1 address 10.255.1.1/30
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;To Different Remotes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Different tunnel interfaces
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun1 remote 198.51.100.2
set interfaces tunnel tun2 remote 198.51.100.3
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Simple tunnels solve simple problems.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Use GRE/IPIP/SIT when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Encapsulation is enough (no encryption needed)&lt;/li&gt;
&lt;li&gt;Running routing protocols over tunnel (GRE)&lt;/li&gt;
&lt;li&gt;IPv6 over IPv4 infrastructure (SIT)&lt;/li&gt;
&lt;li&gt;Minimal overhead matters (IPIP)&lt;/li&gt;
&lt;li&gt;Already have encryption elsewhere&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Don&apos;t use when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data is sensitive (use IPsec/WireGuard)&lt;/li&gt;
&lt;li&gt;Through untrusted networks without encryption&lt;/li&gt;
&lt;li&gt;Need advanced features (VPN, multi-homing)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These protocols are old but not obsolete. They&apos;re tools in the toolkit. Know when to use them and when to reach for something more capable.&lt;/p&gt;
&lt;p&gt;Simple problems deserve simple solutions.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>vpn</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>EVPN: Modern Control Plane for L2 and L3 Services</title><link>https://ashimov.com/posts/vyos-evpn/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-evpn/</guid><description>Understand EVPN architecture and concepts. Covers EVPN route types, MAC/IP learning via BGP, multi-homing, VXLAN integration, and why EVPN is the future of overlay networking.</description><pubDate>Fri, 06 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;VPLS floods unknown unicast. Every PE learns every MAC. Multi-homing is complicated. It works, but it&apos;s showing its age.&lt;/p&gt;
&lt;p&gt;EVPN (Ethernet VPN) fixes these problems. MAC addresses are distributed via BGP, not learned via data plane flooding. Multi-homing is first-class. Both L2 and L3 services use the same control plane. It works with MPLS or VXLAN underneath.&lt;/p&gt;
&lt;p&gt;EVPN is the future of overlay networking.&lt;/p&gt;
&lt;h2&gt;EVPN vs VPLS&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;VPLS&lt;/th&gt;
&lt;th&gt;EVPN&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MAC learning&lt;/td&gt;
&lt;td&gt;Data plane (flooding)&lt;/td&gt;
&lt;td&gt;Control plane (BGP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unknown unicast&lt;/td&gt;
&lt;td&gt;Flood to all PEs&lt;/td&gt;
&lt;td&gt;Only to destination PE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-homing&lt;/td&gt;
&lt;td&gt;Complex (MC-LAG)&lt;/td&gt;
&lt;td&gt;Native (active-active)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IP routing&lt;/td&gt;
&lt;td&gt;Separate (L3VPN)&lt;/td&gt;
&lt;td&gt;Integrated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;EVPN Concepts&lt;/h2&gt;
&lt;h3&gt;How EVPN Works&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Host A connects to PE1
2. PE1 learns MAC-A locally
3. PE1 advertises MAC-A via BGP EVPN
4. All PEs install MAC-A → PE1
5. Traffic to MAC-A goes directly to PE1
   (no flooding!)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;EVPN Route Types&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Ethernet Auto-Discovery&lt;/td&gt;
&lt;td&gt;Multi-homing, aliasing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;MAC/IP Advertisement&lt;/td&gt;
&lt;td&gt;MAC and IP bindings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Inclusive Multicast&lt;/td&gt;
&lt;td&gt;BUM traffic handling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Ethernet Segment&lt;/td&gt;
&lt;td&gt;Multi-homing ESI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;IP Prefix&lt;/td&gt;
&lt;td&gt;L3 routing (EVPN Type-5)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Type 2: MAC/IP Route&lt;/h3&gt;
&lt;p&gt;The most common route type:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Route Distinguisher: 10.255.0.1:100
MAC Address: aa:bb:cc:dd:ee:ff
IP Address: 192.168.1.10 (optional)
Label: 1001
Next-hop: 10.255.0.1

&quot;MAC aa:bb:cc:dd:ee:ff (with IP 192.168.1.10) is behind PE 10.255.0.1&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Type 5: IP Prefix Route&lt;/h3&gt;
&lt;p&gt;For L3 routing in EVPN:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Route Distinguisher: 10.255.0.1:100
IP Prefix: 10.0.0.0/24
Gateway IP: 0.0.0.0 (optional)
Label: 2001

&quot;Route to 10.0.0.0/24 is behind PE 10.255.0.1&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;EVPN with VXLAN&lt;/h2&gt;
&lt;p&gt;Most common deployment: EVPN control plane + VXLAN data plane&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Host A] ─ [Leaf1/VTEP] ═══ VXLAN ═══ [Leaf2/VTEP] ─ [Host B]
              │     IP underlay      │
           EVPN learns MAC/IP     EVPN learns MAC/IP
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VyOS VXLAN with EVPN&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# VXLAN interface
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.255.0.1
set interfaces vxlan vxlan100 parameters nolearning

# nolearning: Don&apos;t use data plane learning (EVPN handles it)

# Bridge
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP EVPN Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BGP with EVPN address family
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.1

# EVPN neighbor
set protocols bgp neighbor 10.255.0.2 remote-as 65000
set protocols bgp neighbor 10.255.0.2 address-family l2vpn-evpn

# Advertise all VNIs
set protocols bgp address-family l2vpn-evpn advertise-all-vni

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;EVPN Multi-Homing&lt;/h2&gt;
&lt;h3&gt;Ethernet Segment (ES)&lt;/h3&gt;
&lt;p&gt;Multiple PEs connected to same host/switch:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                 [PE1] ─┐
[Server/Switch] ═══════┼═══ EVPN
                 [PE2] ─┘

ES (Ethernet Segment) = The dual-homed connection
ESI (ES Identifier) = Unique ID for the ES
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Active-Active Multi-Homing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define Ethernet Segment (both PEs)
set interfaces bonding bond0 member interface eth1
set interfaces bonding bond0 evpn ethernet-segment identifier 00:11:22:33:44:55:66:77:88

# Both PEs actively forward
# EVPN handles aliasing and split horizon

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Active-active forwarding (both links used)&lt;/li&gt;
&lt;li&gt;Fast failover (no waiting for MAC learning)&lt;/li&gt;
&lt;li&gt;Loop prevention (DF election)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;EVPN Integrated Routing and Bridging (IRB)&lt;/h2&gt;
&lt;p&gt;Same EVPN instance provides L2 and L3:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Bridge for L2
set interfaces bridge br100 member interface vxlan100

# IRB interface for L3
set interfaces bridge br100 address 192.168.100.1/24

# Host in VXLAN 100 can:
# - L2 switch to other hosts in same VXLAN
# - L3 route to other networks via 192.168.100.1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Symmetric vs Asymmetric IRB&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Asymmetric:
- Routing at ingress only
- Frame sent as L2 to egress PE
- Simpler but requires all VNIs everywhere

Symmetric:
- Routing at ingress and egress
- Uses L3 VNI for inter-subnet
- Better for large scale
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;EVPN Route Targets&lt;/h2&gt;
&lt;p&gt;Similar to L3VPN:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# EVPN RT configuration
set vrf name TENANT-A protocols bgp address-family l2vpn-evpn rd 10.255.0.1:100
set vrf name TENANT-A protocols bgp address-family l2vpn-evpn route-target export 65000:100
set vrf name TENANT-A protocols bgp address-family l2vpn-evpn route-target import 65000:100

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing EVPN State&lt;/h2&gt;
&lt;h3&gt;Show EVPN Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all EVPN routes
show bgp l2vpn evpn

# Show specific route types
show bgp l2vpn evpn route type macip
show bgp l2vpn evpn route type multicast
show bgp l2vpn evpn route type prefix

# Show VNI information
show evpn vni
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show MAC Table&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# EVPN-learned MACs
show evpn mac vni 100

# Output:
# VNI    MAC              Type    Interface
# 100    aa:bb:cc:dd:ee:ff local   eth1
# 100    11:22:33:44:55:66 remote  10.255.0.2
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;EVPN Design Patterns&lt;/h2&gt;
&lt;h3&gt;Leaf-Spine with EVPN&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;           [Spine1]   [Spine2]
              │ ╲     ╱ │
              │  ╲   ╱  │
              │   ╲ ╱   │
              │   ╱ ╲   │
              │  ╱   ╲  │
           [Leaf1]   [Leaf2]
              │         │
           [Host A] [Host B]

eBGP underlay: Spines as route reflectors
EVPN overlay: Leaf-to-leaf via spines
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;EVPN-VXLAN Fabric&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Leaf configuration
set protocols bgp neighbor &amp;lt;spine1&amp;gt; remote-as &amp;lt;spine-as&amp;gt;
set protocols bgp neighbor &amp;lt;spine1&amp;gt; address-family l2vpn-evpn

# Spines reflect EVPN routes
# VTEPs on leafs

# Underlay provides IP connectivity
# EVPN provides MAC/IP learning
# VXLAN provides encapsulation
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting EVPN&lt;/h2&gt;
&lt;h3&gt;No EVPN Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check BGP session
show bgp l2vpn evpn summary

# Verify EVPN address family negotiated
show bgp neighbor &amp;lt;ip&amp;gt; | grep -i evpn

# Check local VNI
show evpn vni
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MAC Not Advertised&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check local MAC learning
show bridge fdb

# Check EVPN advertisement
show bgp l2vpn evpn route type macip

# Verify VNI-to-VXLAN mapping
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Traffic Not Flowing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify VXLAN tunnel
ping &amp;lt;remote-vtep&amp;gt;

# Check encapsulation
sudo tcpdump -i eth0 udp port 4789

# Verify MAC in remote VNI
show evpn mac vni 100 mac &amp;lt;mac-address&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VyOS EVPN Status&lt;/h2&gt;
&lt;p&gt;VyOS EVPN support is evolving:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check current version capabilities
show version

# VyOS 1.4+ has improved EVPN support via FRRouting
# Test features in lab before production
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Migration from VPLS to EVPN&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Phase 1: Deploy EVPN parallel to VPLS
Phase 2: Migrate traffic gradually
Phase 3: Decommission VPLS

Key difference:
- VPLS: Flood and learn
- EVPN: Advertise and install

Can run both during migration
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;EVPN is the future of overlay networking.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;EVPN advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Control plane MAC learning (no flooding)&lt;/li&gt;
&lt;li&gt;Native multi-homing support&lt;/li&gt;
&lt;li&gt;Integrated L2 and L3&lt;/li&gt;
&lt;li&gt;Works with MPLS or VXLAN&lt;/li&gt;
&lt;li&gt;Scales better than VPLS&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When to use EVPN:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data center fabrics&lt;/li&gt;
&lt;li&gt;Multi-tenant environments&lt;/li&gt;
&lt;li&gt;Stretched L2 domains&lt;/li&gt;
&lt;li&gt;Any new overlay deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VyOS EVPN support depends on version. For production, verify specific features work. For learning and smaller deployments, VyOS provides a good platform to understand EVPN concepts.&lt;/p&gt;
&lt;p&gt;The concepts here apply regardless of platform. EVPN is the direction the industry is moving — understanding it is essential for modern network engineering.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>networking</category><category>overlay</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>VPLS: Layer 2 VPN Over MPLS</title><link>https://ashimov.com/posts/vyos-vpls/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-vpls/</guid><description>Understand VPLS concepts and configuration. Covers virtual switch model, BGP signaling, pseudowires, MAC learning, and why VPLS provides multipoint L2 connectivity.</description><pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;L3VPN provides routed connectivity — IP packets forwarded between sites. But sometimes you need Layer 2 — Ethernet frames bridged as if sites were on the same switch. Same broadcast domain, same VLAN, same ARP visibility.&lt;/p&gt;
&lt;p&gt;VPLS (Virtual Private LAN Service) provides exactly this. Multiple sites connected via MPLS backbone, appearing as a single Ethernet switch. Frames are encapsulated, labeled, and forwarded. MAC addresses are learned. Broadcast is flooded.&lt;/p&gt;
&lt;p&gt;VPLS provides multipoint Layer 2 connectivity over any-to-any MPLS.&lt;/p&gt;
&lt;h2&gt;VPLS Concepts&lt;/h2&gt;
&lt;h3&gt;What VPLS Does&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Without VPLS:
[Site A] ─── separate switches ─── [Site B]
          │                      │
Different Layer 2 domains

With VPLS:
[Site A] ══════ VPLS ══════ [Site B]
          │                │
Same Layer 2 domain (virtual switch)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Components&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;PE (Provider Edge): Participates in VPLS, bridges customer frames
P (Provider): MPLS core, just label switching
CE (Customer Edge): Regular switch/router, no VPLS awareness
VSI (Virtual Switch Instance): Virtual switch on PE

[CE1] ─ [PE1] ═══════════════════ [PE2] ─ [CE2]
         │        MPLS            │
        VSI ←─ pseudowires ─→ VSI
         │     (label-switched)   │
        [PE3]
         │
        [CE3]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VPLS vs L3VPN&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;VPLS (L2VPN)&lt;/th&gt;
&lt;th&gt;L3VPN&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Forwarding&lt;/td&gt;
&lt;td&gt;Bridge (MAC)&lt;/td&gt;
&lt;td&gt;Route (IP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer sees&lt;/td&gt;
&lt;td&gt;Same switch&lt;/td&gt;
&lt;td&gt;Router hop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol&lt;/td&gt;
&lt;td&gt;Ethernet&lt;/td&gt;
&lt;td&gt;IP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broadcast&lt;/td&gt;
&lt;td&gt;Flooded&lt;/td&gt;
&lt;td&gt;Terminated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MAC learning&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;VPLS Signaling&lt;/h2&gt;
&lt;h3&gt;BGP-Based VPLS (RFC 4761)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Control plane: BGP
# PE routers exchange VPLS membership via BGP
# Auto-discovery of other PEs in same VPLS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;LDP-Based VPLS (RFC 4762)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Control plane: Targeted LDP
# Pseudowires signaled via LDP
# Requires explicit configuration of remote PEs
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VPLS on VyOS&lt;/h2&gt;
&lt;p&gt;VyOS VPLS support depends on version. Basic pseudowire configuration:&lt;/p&gt;
&lt;h3&gt;Pseudowire Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create pseudowire interface
set interfaces pseudo-ethernet peth0 link eth1
set interfaces pseudo-ethernet peth0 mode private

# Note: Full VPLS requires additional configuration
# VyOS may not support complete VPLS feature set

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;L2TPv3 for L2VPN (Alternative)&lt;/h3&gt;
&lt;p&gt;VyOS supports L2TPv3 which can provide similar L2 connectivity:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# L2TPv3 tunnel
set interfaces l2tpv3 l2tpeth0 source-address 10.0.0.1
set interfaces l2tpv3 l2tpeth0 remote 10.0.0.2
set interfaces l2tpv3 l2tpeth0 tunnel-id 100
set interfaces l2tpv3 l2tpeth0 peer-tunnel-id 100
set interfaces l2tpv3 l2tpeth0 session-id 100
set interfaces l2tpv3 l2tpeth0 peer-session-id 100
set interfaces l2tpv3 l2tpeth0 encapsulation udp
set interfaces l2tpv3 l2tpeth0 source-port 5000
set interfaces l2tpv3 l2tpeth0 destination-port 5000

# Bridge with local interface
set interfaces bridge br0 member interface l2tpeth0
set interfaces bridge br0 member interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VPLS Design Considerations&lt;/h2&gt;
&lt;h3&gt;Full Mesh Pseudowires&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;N PEs requires N×(N-1)/2 pseudowires

3 PEs: 3 pseudowires
5 PEs: 10 pseudowires
10 PEs: 45 pseudowires
20 PEs: 190 pseudowires

Scales poorly!
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hierarchical VPLS (H-VPLS)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Solution: Hub-spoke at access layer

[CE]─[MTU]────[PE-aggregation]═══full-mesh═══[PE-aggregation]────[MTU]─[CE]

MTU = Multi-Tenant Unit (spoke)
PE = Hub, full mesh only between PEs
Reduces pseudowire count
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MAC Address Learning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Frame arrives at PE1 from CE1:
- PE1 learns: MAC-A is behind local interface
- PE1 floods to PE2, PE3 (VPLS peers)
- PE2 learns: MAC-A is behind PE1 (pseudowire)

Frame from CE2 to MAC-A:
- PE2 knows MAC-A → pseudowire to PE1
- PE1 knows MAC-A → local interface
- Frame delivered
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Unknown Unicast Flooding&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Unknown destination MAC:
- PE floods to all pseudowires
- Like a regular switch with unknown MAC
- All PEs receive, only one delivers

Broadcast/Multicast:
- Always flooded to all pseudowires
- Bandwidth consideration!
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VPLS Challenges&lt;/h2&gt;
&lt;h3&gt;Split Horizon&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Frame received from pseudowire:
- Don&apos;t send back to pseudowires
- Only send to local interfaces

Prevents loops in full-mesh VPLS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MAC Table Scaling&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Every PE learns MACs from all sites:
- 10 sites × 1000 MACs = 10,000 MACs per PE
- Large deployments can exhaust MAC table

Solution:
- MAC address limits per VPLS
- MAC aging timers
- H-VPLS to contain scope
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Spanning Tree&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Customer running STP across VPLS:
- VPLS is loop-free (split horizon)
- STP BPDUs still traverse
- Can cause suboptimal paths

Options:
- Disable STP (VPLS handles loops)
- Tunnel STP (let customer manage)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;When to Use VPLS&lt;/h2&gt;
&lt;h3&gt;Good Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Extending VLANs across WAN&lt;/li&gt;
&lt;li&gt;Data center interconnect (legacy)&lt;/li&gt;
&lt;li&gt;Customers requiring L2 adjacency&lt;/li&gt;
&lt;li&gt;Migration scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Not Ideal For&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;New greenfield deployments (use EVPN)&lt;/li&gt;
&lt;li&gt;Very large scale (MAC table limits)&lt;/li&gt;
&lt;li&gt;Multi-homing requirements (EVPN better)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Modern Alternative: EVPN&lt;/h2&gt;
&lt;p&gt;EVPN provides similar L2 connectivity with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Better MAC learning (BGP-based)&lt;/li&gt;
&lt;li&gt;Active-active multi-homing&lt;/li&gt;
&lt;li&gt;Better scalability&lt;/li&gt;
&lt;li&gt;Integrated L2 and L3&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;VPLS → EVPN migration is common trend
New deployments should consider EVPN first
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting VPLS&lt;/h2&gt;
&lt;h3&gt;Pseudowire Not Up&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check MPLS connectivity
show mpls ldp neighbor
ping &amp;lt;remote-PE-loopback&amp;gt;

# Check pseudowire status
# (command depends on implementation)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MAC Not Learned&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check MAC table
show bridge fdb

# Verify VLAN tagging matches
show interfaces ethernet eth1

# Capture traffic
sudo tcpdump -i eth1 ether host &amp;lt;mac-address&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Flooding Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Monitor pseudowire traffic
# Excessive flooding = possible MAC learning issue

# Check split horizon
# Packets from pseudowire shouldn&apos;t go back to pseudowires
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;VPLS provides multipoint Layer 2 connectivity over MPLS.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;VPLS enables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple sites on same Layer 2&lt;/li&gt;
&lt;li&gt;Transparent LAN extension&lt;/li&gt;
&lt;li&gt;Bridge instead of route&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Considerations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pseudowire full-mesh scales poorly&lt;/li&gt;
&lt;li&gt;MAC learning at every PE&lt;/li&gt;
&lt;li&gt;Broadcast/unknown flooded everywhere&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VyOS VPLS support is limited. For L2 extension:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Check current VyOS version features&lt;/li&gt;
&lt;li&gt;Consider L2TPv3 as alternative&lt;/li&gt;
&lt;li&gt;Evaluate EVPN for new deployments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VPLS served the industry well, but EVPN is the modern evolution. Understand VPLS concepts — they transfer to EVPN — but default to EVPN for new projects.&lt;/p&gt;
</content:encoded><category>vyos</category><category>mpls</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>L3VPN: MPLS VPN for Multi-Site Connectivity</title><link>https://ashimov.com/posts/vyos-l3vpn/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-l3vpn/</guid><description>Configure MPLS L3VPN on VyOS. Covers VPNv4 address family, route distinguishers, route targets, PE-CE routing, and why L3VPN provides scalable multi-tenant connectivity.</description><pubDate>Fri, 30 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Each customer needs their own routing table. Their addresses might overlap with other customers. They need to reach their own sites but not others. Managing separate physical infrastructure doesn&apos;t scale.&lt;/p&gt;
&lt;p&gt;MPLS L3VPN (Layer 3 VPN) solves this. Each customer gets a Virtual Routing and Forwarding (VRF) instance. Customer routes are distinguished by Route Distinguisher. Route Targets control which VRFs import which routes. The MPLS backbone carries traffic with label stacks identifying VPN and destination.&lt;/p&gt;
&lt;p&gt;L3VPN provides scalable multi-tenant connectivity over shared infrastructure.&lt;/p&gt;
&lt;h2&gt;L3VPN Concepts&lt;/h2&gt;
&lt;h3&gt;Key Components&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;PE (Provider Edge): Has VRFs, connects to customers
P (Provider): Core router, just MPLS forwarding
CE (Customer Edge): Customer router, no VPN awareness

[CE1] ─── [PE1] ═══ MPLS ═══ [PE2] ─── [CE2]
  Site A           Backbone           Site B
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VRF (Virtual Routing and Forwarding)&lt;/h3&gt;
&lt;p&gt;Separate routing table per customer:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;VRF CustomerA: 10.0.0.0/8 → Site A
VRF CustomerB: 10.0.0.0/8 → Site B  (same addresses, different VRF)
Global: Provider infrastructure only
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Route Distinguisher (RD)&lt;/h3&gt;
&lt;p&gt;Makes routes unique in BGP (not for forwarding):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Without RD: 10.0.0.0/8 from Customer A
            10.0.0.0/8 from Customer B  ← Collision!

With RD: 65000:1:10.0.0.0/8 from Customer A
         65000:2:10.0.0.0/8 from Customer B  ← Unique
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Route Target (RT)&lt;/h3&gt;
&lt;p&gt;Controls route import/export between VRFs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Export RT: &quot;Tag this route for customers who want it&quot;
Import RT: &quot;Import routes with this tag&quot;

CustomerA-VRF exports: 65000:100
CustomerA-VRF imports: 65000:100  (import own routes at other sites)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Basic L3VPN Configuration&lt;/h2&gt;
&lt;h3&gt;PE Router Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create VRF for customer
set vrf name CUSTOMER-A table 10
set vrf name CUSTOMER-A description &quot;Customer A VPN&quot;

# Route Distinguisher (unique per VRF)
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast rd 65000:1

# Route Targets
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target import 65000:100

# Assign interface to VRF
set interfaces ethernet eth1 vrf CUSTOMER-A
set interfaces ethernet eth1 address 192.168.1.1/24

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Enable VPNv4 Address Family&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BGP configuration
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.1

# VPNv4 address family with PE peers
set protocols bgp neighbor 10.255.0.2 remote-as 65000
set protocols bgp neighbor 10.255.0.2 update-source lo
set protocols bgp neighbor 10.255.0.2 address-family ipv4-vpn

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Redistribute Customer Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# In the VRF context
set vrf name CUSTOMER-A protocols bgp system-as 65000
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast redistribute connected

# Or with static routes from CE
set vrf name CUSTOMER-A protocols static route 10.1.0.0/16 next-hop 192.168.1.2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;PE-CE Routing&lt;/h2&gt;
&lt;h3&gt;Static Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Static routes from CE
set vrf name CUSTOMER-A protocols static route 10.1.0.0/16 next-hop 192.168.1.2
set vrf name CUSTOMER-A protocols static route 10.2.0.0/16 next-hop 192.168.1.2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;eBGP PE-CE&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BGP session with CE router
set vrf name CUSTOMER-A protocols bgp neighbor 192.168.1.2 remote-as 65100
set vrf name CUSTOMER-A protocols bgp neighbor 192.168.1.2 address-family ipv4-unicast

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF PE-CE&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF with CE
set vrf name CUSTOMER-A protocols ospf interface eth1
set vrf name CUSTOMER-A protocols ospf redistribute bgp

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Complete L3VPN Example&lt;/h2&gt;
&lt;h3&gt;Topology&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Customer A Site 1          Provider Backbone          Customer A Site 2
[CE1] ─── [PE1] ═══════════════════════════════ [PE2] ─── [CE2]
10.1.0.0/16    VRF:CUST-A                         VRF:CUST-A    10.2.0.0/16
               RD 65000:1                         RD 65000:1
               RT 65000:100                       RT 65000:100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;PE1 Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Loopback for BGP
set interfaces loopback lo address 10.255.0.1/32

# VRF
set vrf name CUST-A table 10
set vrf name CUST-A protocols bgp address-family ipv4-unicast rd 65000:1
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target import 65000:100

# Customer interface
set interfaces ethernet eth1 vrf CUST-A
set interfaces ethernet eth1 address 192.168.1.1/24

# Core interface
set interfaces ethernet eth0 address 10.0.0.1/30

# MPLS LDP
set protocols mpls ldp router-id 10.255.0.1
set protocols mpls ldp interface eth0

# OSPF for backbone
set protocols ospf area 0 network 10.255.0.1/32
set protocols ospf area 0 network 10.0.0.0/30

# BGP
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.1
set protocols bgp neighbor 10.255.0.2 remote-as 65000
set protocols bgp neighbor 10.255.0.2 update-source lo
set protocols bgp neighbor 10.255.0.2 address-family ipv4-vpn

# VRF BGP
set vrf name CUST-A protocols bgp system-as 65000
set vrf name CUST-A protocols bgp neighbor 192.168.1.2 remote-as 65100
set vrf name CUST-A protocols bgp neighbor 192.168.1.2 address-family ipv4-unicast

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;PE2 Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Loopback
set interfaces loopback lo address 10.255.0.2/32

# VRF (same RT for same customer)
set vrf name CUST-A table 10
set vrf name CUST-A protocols bgp address-family ipv4-unicast rd 65000:1
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target import 65000:100

# Customer interface
set interfaces ethernet eth1 vrf CUST-A
set interfaces ethernet eth1 address 192.168.2.1/24

# Core interface
set interfaces ethernet eth0 address 10.0.0.2/30

# MPLS LDP
set protocols mpls ldp router-id 10.255.0.2
set protocols mpls ldp interface eth0

# OSPF
set protocols ospf area 0 network 10.255.0.2/32
set protocols ospf area 0 network 10.0.0.0/30

# BGP
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.2
set protocols bgp neighbor 10.255.0.1 remote-as 65000
set protocols bgp neighbor 10.255.0.1 update-source lo
set protocols bgp neighbor 10.255.0.1 address-family ipv4-vpn

# VRF BGP
set vrf name CUST-A protocols bgp system-as 65000
set vrf name CUST-A protocols bgp neighbor 192.168.2.2 remote-as 65100
set vrf name CUST-A protocols bgp neighbor 192.168.2.2 address-family ipv4-unicast

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing L3VPN State&lt;/h2&gt;
&lt;h3&gt;Show VPN Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show VPNv4 routes
show bgp ipv4 vpn

# Output:
# Route Distinguisher: 65000:1
# *&amp;gt;  10.1.0.0/16      192.168.1.2     0   0    65100 i
# *&amp;gt;  10.2.0.0/16      10.255.0.2      0   0    65100 i

# Show specific VRF routes
show ip route vrf CUST-A
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Labels&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show VPN labels
show bgp ipv4 vpn labels

# Show MPLS forwarding table
show mpls table
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify Connectivity&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Ping within VRF
ping 10.2.0.1 vrf CUST-A

# Traceroute within VRF
traceroute 10.2.0.1 vrf CUST-A
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Route Target Patterns&lt;/h2&gt;
&lt;h3&gt;Hub-and-Spoke&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Hub imports all, spokes import only from hub
# Hub VRF:
set vrf name HUB protocols bgp address-family ipv4-unicast route-target export 65000:1
set vrf name HUB protocols bgp address-family ipv4-unicast route-target import 65000:2

# Spoke VRF:
set vrf name SPOKE1 protocols bgp address-family ipv4-unicast route-target export 65000:2
set vrf name SPOKE1 protocols bgp address-family ipv4-unicast route-target import 65000:1

# Traffic: Spoke → Hub → Spoke (forced through hub)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Full Mesh&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# All sites import from all sites
set vrf name SITE protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name SITE protocols bgp address-family ipv4-unicast route-target import 65000:100

# Any-to-any connectivity
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Extranet&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Customer A can reach shared services
# Customer A VRF:
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target import 65000:999  # Shared services RT

# Shared Services VRF:
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target export 65000:999
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting L3VPN&lt;/h2&gt;
&lt;h3&gt;Routes Not Exchanged&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check VPNv4 session
show bgp ipv4 vpn summary

# Check RT configuration
show vrf CUST-A

# Verify import/export RT match between sites
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MPLS Labels Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check MPLS is enabled on core interfaces
show interfaces ethernet eth0

# Check LDP neighbor
show mpls ldp neighbor

# Check MPLS table
show mpls table
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Traffic Not Flowing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify VRF routing table
show ip route vrf CUST-A

# Check label stack
show bgp ipv4 vpn 10.2.0.0/16

# Trace path
traceroute 10.2.0.1 vrf CUST-A
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;L3VPN provides scalable multi-tenant connectivity over shared infrastructure.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Without L3VPN:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Separate physical networks per customer&lt;/li&gt;
&lt;li&gt;Address overlap impossible&lt;/li&gt;
&lt;li&gt;Doesn&apos;t scale&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With L3VPN:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single MPLS backbone serves all customers&lt;/li&gt;
&lt;li&gt;VRFs provide isolation&lt;/li&gt;
&lt;li&gt;Overlapping addresses supported (different RDs)&lt;/li&gt;
&lt;li&gt;Route Targets control connectivity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Key concepts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;VRF&lt;/strong&gt;: Separate routing table per customer&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RD&lt;/strong&gt;: Makes routes globally unique&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RT&lt;/strong&gt;: Controls import/export between VRFs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VPNv4&lt;/strong&gt;: BGP carrying VPN routes with labels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VyOS L3VPN support requires MPLS. Verify feature support in your version before production deployment.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>mpls</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>BGP-LU: Labeled Unicast for Scalable MPLS Networks</title><link>https://ashimov.com/posts/vyos-bgp-lu/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-bgp-lu/</guid><description>Configure BGP Labeled Unicast on VyOS. Covers label distribution via BGP, inter-AS MPLS, seamless MPLS concepts, and why BGP-LU replaces LDP in modern designs.</description><pubDate>Tue, 27 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;LDP distributes labels within an autonomous system. But LDP doesn&apos;t cross AS boundaries. For MPLS across multiple ASes, you need something else.&lt;/p&gt;
&lt;p&gt;BGP Labeled Unicast (BGP-LU) distributes MPLS labels via BGP — the same protocol already handling inter-AS routing. Labels follow prefixes across AS boundaries, enabling end-to-end MPLS paths spanning multiple networks.&lt;/p&gt;
&lt;p&gt;BGP-LU replaces LDP in modern scalable designs.&lt;/p&gt;
&lt;h2&gt;Why BGP-LU&lt;/h2&gt;
&lt;h3&gt;LDP Limitations&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;AS 65001              AS 65002
[PE1]──[P1]──[ASBR1]  [ASBR2]──[P2]──[PE2]
     LDP               │         LDP
                       │
                   eBGP session
                   (no labels!)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LDP sessions don&apos;t cross AS boundaries. MPLS stops at the border.&lt;/p&gt;
&lt;h3&gt;BGP-LU Solution&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;AS 65001              AS 65002
[PE1]──[P1]──[ASBR1]══[ASBR2]──[P2]──[PE2]
                   BGP-LU
              (prefixes + labels)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;BGP carries labels along with prefixes. MPLS continues across ASes.&lt;/p&gt;
&lt;h2&gt;BGP-LU Basics&lt;/h2&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. PE1 allocates label for its loopback (10.255.0.1/32)
2. PE1 advertises via BGP: 10.255.0.1/32, label 1001
3. Intermediate routers receive prefix+label
4. Traffic to 10.255.0.1 gets labeled 1001 at ingress
5. Label-switched to PE1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP-LU vs LDP&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;LDP&lt;/th&gt;
&lt;th&gt;BGP-LU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Single AS&lt;/td&gt;
&lt;td&gt;Multi-AS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocol&lt;/td&gt;
&lt;td&gt;Separate (LDP)&lt;/td&gt;
&lt;td&gt;Existing (BGP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Label binding&lt;/td&gt;
&lt;td&gt;FEC-based&lt;/td&gt;
&lt;td&gt;Prefix-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;IGP-limited&lt;/td&gt;
&lt;td&gt;BGP-limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational&lt;/td&gt;
&lt;td&gt;Two protocols&lt;/td&gt;
&lt;td&gt;One protocol&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Configuring BGP-LU&lt;/h2&gt;
&lt;h3&gt;Enable Labeled Unicast Address Family&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable labeled unicast for IPv4
set protocols bgp address-family ipv4-labeled-unicast

# Redistribute connected/loopback with labels
set protocols bgp address-family ipv4-labeled-unicast redistribute connected

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure BGP-LU Neighbor&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# iBGP neighbor with labeled unicast
set protocols bgp neighbor 10.255.0.2 remote-as 65001
set protocols bgp neighbor 10.255.0.2 address-family ipv4-labeled-unicast

# eBGP neighbor with labeled unicast
set protocols bgp neighbor 192.0.2.1 remote-as 65002
set protocols bgp neighbor 192.0.2.1 address-family ipv4-labeled-unicast

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Network Statement with Labels&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Advertise loopback with label
set protocols bgp address-family ipv4-labeled-unicast network 10.255.0.1/32

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing BGP-LU State&lt;/h2&gt;
&lt;h3&gt;Show Labeled Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show BGP-LU routes
show bgp ipv4 labeled-unicast

# Output:
# Network          Next Hop        Label      Path
# 10.255.0.1/32    0.0.0.0         1001       i
# 10.255.0.2/32    10.0.0.2        2001       i
# 10.255.0.3/32    192.0.2.1       3001       65002 i
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Label Bindings&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show MPLS table (FRR)
show mpls table

# Show specific prefix label
show bgp ipv4 labeled-unicast 10.255.0.2/32
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Inter-AS MPLS Options&lt;/h2&gt;
&lt;h3&gt;Option A: Back-to-Back VRF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;No BGP-LU needed:
[PE1]──MPLS──[ASBR1]─────VRF─────[ASBR2]──MPLS──[PE2]
                    IP routing
                    (MPLS restarts)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Simple but doesn&apos;t provide end-to-end MPLS.&lt;/p&gt;
&lt;h3&gt;Option B: eBGP Labeled Unicast&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;BGP-LU across ASBRs:
[PE1]──MPLS──[ASBR1]═══BGP-LU═══[ASBR2]──MPLS──[PE2]
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# ASBR1 config
set protocols bgp neighbor 192.0.2.1 remote-as 65002
set protocols bgp neighbor 192.0.2.1 address-family ipv4-labeled-unicast

# ASBR1 must swap labels at boundary
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Option C: Multihop eBGP + Labels&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;BGP-LU between PEs (via route reflector):
[PE1]════════════BGP-LU═════════════[PE2]
        (reflected through ASBRs)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Most scalable, but complex.&lt;/p&gt;
&lt;h2&gt;Seamless MPLS&lt;/h2&gt;
&lt;p&gt;Seamless MPLS uses BGP-LU to create unified MPLS network:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─ Access ─┐   ┌─ Aggregation ─┐   ┌─ Core ─┐
AN ── AGN ── ABR ── CR ── ABR ── AGN ── AN

AN = Access Node
AGN = Aggregation Node
ABR = Area Border Router
CR = Core Router

BGP-LU provides end-to-end label path
No LDP needed in core
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Single label stack (not stacked LDP + BGP labels)&lt;/li&gt;
&lt;li&gt;Scales to millions of prefixes&lt;/li&gt;
&lt;li&gt;Consistent forwarding behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;BGP-LU with VPN Services&lt;/h2&gt;
&lt;h3&gt;L3VPN over BGP-LU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BGP-LU for transport
set protocols bgp address-family ipv4-labeled-unicast

# VPNv4 for customer routes
set protocols bgp address-family ipv4-vpn

# VPNv4 uses BGP-LU next-hop for label stack

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Label Stack&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Outer label: BGP-LU label (transport)
Inner label: VPN label (service)

[VPN Label | BGP-LU Label | IP Packet]
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BGP-LU Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Use Route Reflectors&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# BGP-LU at scale needs route reflectors
# Same as regular BGP

set protocols bgp neighbor 10.255.0.100 remote-as 65001
set protocols bgp neighbor 10.255.0.100 address-family ipv4-labeled-unicast
set protocols bgp neighbor 10.255.0.100 update-source lo
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Filter at Boundaries&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Don&apos;t accept labeled routes from customers
set policy prefix-list INFRA-ONLY rule 10 prefix 10.255.0.0/16 le 32
set policy prefix-list INFRA-ONLY rule 10 action permit

set policy route-map LU-IN rule 10 match ip address prefix-list INFRA-ONLY
set policy route-map LU-IN rule 10 action permit
set policy route-map LU-IN rule 20 action deny

set protocols bgp neighbor 192.0.2.1 address-family ipv4-labeled-unicast route-map import LU-IN
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Consistent Label Allocation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# All routers should use consistent label allocation policy
# Usually per-prefix labeling for PE loopbacks
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting BGP-LU&lt;/h2&gt;
&lt;h3&gt;Labels Not Exchanged&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check capability negotiation
show bgp neighbors 10.0.0.2

# Look for:
# IPv4 Labeled Unicast: advertised and received

# If not:
# - Check address-family configuration
# - Check both sides support labeled unicast
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Wrong Label&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show label for specific prefix
show bgp ipv4 labeled-unicast 10.255.0.2/32

# Verify label in MPLS table
show mpls table

# Check label is being used for forwarding
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Path Not Using Labels&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify next-hop is reachable via MPLS
show bgp ipv4 labeled-unicast 10.255.0.2/32

# Check next-hop resolution
# BGP-LU requires next-hop to have label
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Migration from LDP to BGP-LU&lt;/h2&gt;
&lt;h3&gt;Parallel Operation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Run both LDP and BGP-LU during migration
# LDP handles existing paths
# BGP-LU handles new paths

# Gradually shift traffic to BGP-LU
# Remove LDP when stable
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Order of Operations&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Enable BGP-LU on all routers
2. Verify BGP-LU paths working
3. Prefer BGP-LU over LDP (if needed)
4. Disable LDP
5. Clean up LDP configuration
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;BGP-LU replaces LDP in modern scalable designs.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;LDP works within a single AS but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Doesn&apos;t cross AS boundaries&lt;/li&gt;
&lt;li&gt;Requires separate protocol maintenance&lt;/li&gt;
&lt;li&gt;Scales with IGP (limited)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;BGP-LU advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works across AS boundaries&lt;/li&gt;
&lt;li&gt;Uses existing BGP infrastructure&lt;/li&gt;
&lt;li&gt;Scales with BGP (better)&lt;/li&gt;
&lt;li&gt;Enables Seamless MPLS designs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When to use BGP-LU:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-AS MPLS networks&lt;/li&gt;
&lt;li&gt;Large-scale service provider networks&lt;/li&gt;
&lt;li&gt;New MPLS deployments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VyOS BGP-LU support is functional for basic scenarios. For production SP networks, verify feature completeness in your VyOS version.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>mpls</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>MPLS Introduction: Labels, LDP, and Packet Forwarding</title><link>https://ashimov.com/posts/vyos-mpls/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-mpls/</guid><description>Understand MPLS fundamentals on VyOS. Covers label switching, LDP configuration, penultimate hop popping, MPLS forwarding, and why MPLS is still relevant for service provider networks.</description><pubDate>Fri, 23 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;IP routing makes forwarding decisions at every hop. Each router looks up the destination, checks its routing table, forwards the packet. Repeat at every hop. Works fine, but expensive at scale.&lt;/p&gt;
&lt;p&gt;MPLS (Multi-Protocol Label Switching) adds a label at network edge. Interior routers forward based on label only — a simple table lookup, no IP processing. Labels are swapped at each hop until the edge, where the label is removed and IP routing resumes.&lt;/p&gt;
&lt;p&gt;MPLS is still relevant for service provider networks — enabling VPNs, traffic engineering, and fast forwarding at scale.&lt;/p&gt;
&lt;h2&gt;MPLS Concepts&lt;/h2&gt;
&lt;h3&gt;How MPLS Works&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Without MPLS (IP forwarding):
[IP Header: dst=10.0.0.1] → Router A → [route lookup] → Router B → [route lookup] → Router C

With MPLS:
[IP Header] → Edge Router → adds [Label: 100] → Core Router → swaps [Label: 200] → Edge Router → removes label → [IP Header]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Terms&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Label&lt;/td&gt;
&lt;td&gt;20-bit identifier for forwarding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LSP&lt;/td&gt;
&lt;td&gt;Label Switched Path (tunnel through network)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LDP&lt;/td&gt;
&lt;td&gt;Label Distribution Protocol (assigns labels)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Push&lt;/td&gt;
&lt;td&gt;Add label to packet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pop&lt;/td&gt;
&lt;td&gt;Remove label from packet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Swap&lt;/td&gt;
&lt;td&gt;Replace label with new one&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PHP&lt;/td&gt;
&lt;td&gt;Penultimate Hop Popping&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;MPLS Header&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;┌────────────────────────────────────────────┐
│ Label (20 bits) │ TC │ S │ TTL │           │
│     (0-1048575) │ 3b │1b │ 8b  │           │
└────────────────────────────────────────────┘

TC: Traffic Class (QoS)
S: Bottom of Stack (1 if last label)
TTL: Time to Live
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VyOS MPLS Support&lt;/h2&gt;
&lt;p&gt;VyOS supports MPLS through FRRouting:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check MPLS support
cat /proc/sys/net/mpls/platform_labels
# If exists, MPLS kernel support is available

# Load MPLS modules
modprobe mpls_router
modprobe mpls_iptunnel
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Basic MPLS Configuration&lt;/h2&gt;
&lt;h3&gt;Enable MPLS on Interfaces&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable MPLS kernel support
set system sysctl parameter net.mpls.platform_labels value 100000

# Enable MPLS input on interfaces (via sysctl)
set system sysctl parameter net.mpls.conf.eth0.input value 1
set system sysctl parameter net.mpls.conf.eth1.input value 1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure LDP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable LDP router ID
set protocols mpls ldp router-id 10.255.0.1

# Configure LDP interfaces
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

# Optional: Discovery hello interval
set protocols mpls ldp discovery hello-interval 5
set protocols mpls ldp discovery hello-holdtime 15

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;LDP with Targeted Neighbors&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# For non-adjacent neighbors (over tunnels)
set protocols mpls ldp targeted-neighbor ipv4 address 10.255.0.2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;LDP Operation&lt;/h2&gt;
&lt;h3&gt;LDP Session Establishment&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Router discovers neighbors via Hello packets (UDP 646)
2. TCP session established to neighbor (port 646)
3. Label mappings exchanged
4. LSPs formed
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Viewing LDP Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show LDP neighbors
show mpls ldp neighbor

# Output:
# Peer LDP Identifier: 10.255.0.2:0
#   TCP connection: 10.255.0.1:646 - 10.255.0.2:54321
#   State: Operational
#   Messages sent/received: 1234/5678

# Show LDP bindings (label mappings)
show mpls ldp binding

# Show MPLS forwarding table
show mpls table
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MPLS Forwarding&lt;/h2&gt;
&lt;h3&gt;Label Operations&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# View MPLS forwarding table
show mpls table

# Output:
# Inbound Label  Type    Nexthop           Outbound Label
# 100            LDP     10.0.0.2          200
# 101            LDP     10.0.0.2          201
# 102            LDP     10.0.0.2          implicit-null

# implicit-null = PHP (penultimate hop popping)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Penultimate Hop Popping (PHP)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Without PHP:
[Label:100] → Router A → [Label:200] → Router B → [Label:300] → Router C → [no label] → Destination
                                                                    ↑ Two operations: pop label + IP lookup

With PHP:
[Label:100] → Router A → [Label:200] → Router B → [no label] → Router C → Destination
                                          ↑ Pop here (second-to-last hop)
                                                              ↑ Only IP lookup needed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Router C signals &quot;implicit-null&quot; label to Router B, telling it to pop the label.&lt;/p&gt;
&lt;h2&gt;MPLS with IGP&lt;/h2&gt;
&lt;h3&gt;MPLS + OSPF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Configure OSPF
set protocols ospf area 0 network 10.0.0.0/24
set protocols ospf passive-interface default
set protocols ospf passive-interface lo disable
set protocols ospf interface eth0 passive disable
set protocols ospf interface eth1 passive disable

# LDP follows OSPF paths
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MPLS + IS-IS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Configure IS-IS
set protocols isis interface eth0
set protocols isis interface eth1
set protocols isis net 49.0001.0100.0100.0001.00

# LDP follows IS-IS paths
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Simple MPLS Network Example&lt;/h2&gt;
&lt;h3&gt;Topology&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[CE1] ── [PE1] ═══ [P1] ═══ [PE2] ── [CE2]
         10.255.0.1   10.255.0.2   10.255.0.3

PE = Provider Edge (MPLS edge)
P = Provider (MPLS core)
CE = Customer Edge (no MPLS)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;PE1 Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Loopback for router-id
set interfaces loopback lo address 10.255.0.1/32

# WAN interface toward P1
set interfaces ethernet eth0 address 10.0.0.1/30

# Customer interface (no MPLS)
set interfaces ethernet eth1 address 192.168.1.1/24

# OSPF
set protocols ospf router-id 10.255.0.1
set protocols ospf area 0 network 10.255.0.1/32
set protocols ospf area 0 network 10.0.0.0/30

# LDP
set protocols mpls ldp router-id 10.255.0.1
set protocols mpls ldp interface eth0

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;P1 Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Loopback
set interfaces loopback lo address 10.255.0.2/32

# Interfaces
set interfaces ethernet eth0 address 10.0.0.2/30
set interfaces ethernet eth1 address 10.0.0.5/30

# OSPF
set protocols ospf router-id 10.255.0.2
set protocols ospf area 0 network 10.255.0.2/32
set protocols ospf area 0 network 10.0.0.0/30
set protocols ospf area 0 network 10.0.0.4/30

# LDP
set protocols mpls ldp router-id 10.255.0.2
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting MPLS&lt;/h2&gt;
&lt;h3&gt;LDP Neighbor Not Forming&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check interface MPLS is enabled
show interfaces ethernet eth0

# Check LDP is listening
ss -tulnp | grep 646

# Check for LDP hellos
sudo tcpdump -i eth0 udp port 646

# Check OSPF/IGP adjacency (LDP follows IGP)
show ip ospf neighbor
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Labels Not Assigned&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check LDP bindings
show mpls ldp binding

# Check MPLS forwarding table
show mpls table

# Verify MPLS modules loaded
lsmod | grep mpls
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Packets Not Label-Switched&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify ingress interface has MPLS enabled
cat /proc/sys/net/mpls/conf/eth0/input

# Should be 1, if 0:
echo 1 &amp;gt; /proc/sys/net/mpls/conf/eth0/input

# Check kernel MPLS support
cat /proc/sys/net/mpls/platform_labels
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MPLS MTU Considerations&lt;/h2&gt;
&lt;p&gt;MPLS label adds 4 bytes per label:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Standard Ethernet MTU: 1500
# With one MPLS label: 1500 - 4 = 1496 effective payload
# With two labels (VPN): 1500 - 8 = 1492 effective payload

# Option 1: Increase MTU on MPLS interfaces
set interfaces ethernet eth0 mtu 1508

# Option 2: Fragment at ingress (less efficient)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MPLS Security&lt;/h2&gt;
&lt;h3&gt;Control Plane Security&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Restrict LDP sessions
# Only accept from known neighbors
set protocols mpls ldp neighbor 10.255.0.2 password &quot;secret&quot;

# Filter LDP discovery
# (Use firewall to limit UDP 646)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Data Plane Considerations&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# MPLS doesn&apos;t encrypt traffic
# Anyone on the path can read label and content

# For encryption, use:
# - IPsec over MPLS
# - MACSec at Layer 2
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;MPLS is still relevant for service provider networks.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MPLS provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast forwarding (label lookup vs. IP lookup)&lt;/li&gt;
&lt;li&gt;VPN services (L2VPN, L3VPN)&lt;/li&gt;
&lt;li&gt;Traffic engineering (explicit paths)&lt;/li&gt;
&lt;li&gt;QoS capabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VyOS MPLS support is functional but limited:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Basic LDP works&lt;/li&gt;
&lt;li&gt;Advanced features (RSVP-TE, Segment Routing) may be limited&lt;/li&gt;
&lt;li&gt;Check VyOS version for specific feature support&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For modern networks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small networks: IP routing is fine&lt;/li&gt;
&lt;li&gt;Large SP networks: MPLS still valuable&lt;/li&gt;
&lt;li&gt;Newer alternative: Segment Routing (SR-MPLS, SRv6)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understand MPLS fundamentals even if you don&apos;t use it daily — many service provider networks and VPN services depend on it.&lt;/p&gt;
</content:encoded><category>vyos</category><category>mpls</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>BGP Dampening: Suppressing Route Flapping</title><link>https://ashimov.com/posts/vyos-bgp-dampening/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-bgp-dampening/</guid><description>Configure BGP route dampening on VyOS. Covers dampening parameters, penalty calculation, route suppression, reuse thresholds, and why dampening prevents unstable routes from destabilizing your network.</description><pubDate>Tue, 20 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A remote router has a flaky connection. Route appears, disappears, appears again — 10 times per minute. Each flap propagates through BGP. Your router processes updates. Your peers process updates. The entire internet processes updates. All for a route that will flap again in seconds.&lt;/p&gt;
&lt;p&gt;BGP dampening penalizes routes that flap frequently. After enough flaps, the route is suppressed — temporarily hidden until it proves stable. This protects your network from chasing unstable routes.&lt;/p&gt;
&lt;p&gt;Dampening prevents unstable routes from destabilizing your network.&lt;/p&gt;
&lt;h2&gt;How Dampening Works&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Route flap detected:
  → Penalty added (e.g., +1000)
  → If penalty &amp;gt; suppress threshold: route suppressed
  → Penalty decays over time
  → If penalty &amp;lt; reuse threshold: route unsuppressed
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Parameters&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Typical Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Half-life&lt;/td&gt;
&lt;td&gt;Time for penalty to decay by half&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reuse&lt;/td&gt;
&lt;td&gt;Penalty below which route is reused&lt;/td&gt;
&lt;td&gt;750&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Suppress&lt;/td&gt;
&lt;td&gt;Penalty above which route is suppressed&lt;/td&gt;
&lt;td&gt;2000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max-suppress&lt;/td&gt;
&lt;td&gt;Maximum suppression time&lt;/td&gt;
&lt;td&gt;60 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Example Timeline&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Time 0:00 - Route withdrawn  → Penalty: 1000
Time 0:01 - Route announced  → Penalty: 2000 (flap)
Time 0:02 - Route withdrawn  → Penalty: 3000 → SUPPRESSED (&amp;gt; 2000)
Time 0:03 - Route announced  → Still suppressed, penalty +1000 = 4000

Time 15:00 - Penalty decayed to 2000 (half-life)
Time 30:00 - Penalty decayed to 1000
Time 35:00 - Penalty decayed to ~750 → REUSED (&amp;lt; 750)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Basic Dampening Configuration&lt;/h2&gt;
&lt;h3&gt;Enable Dampening Globally&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable with default parameters
set protocols bgp address-family ipv4-unicast dampening

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Custom Parameters&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Custom dampening parameters
set protocols bgp address-family ipv4-unicast dampening half-life 15
set protocols bgp address-family ipv4-unicast dampening re-use 750
set protocols bgp address-family ipv4-unicast dampening start-suppress-time 2000
set protocols bgp address-family ipv4-unicast dampening max-suppress-time 60

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Per-Neighbor Dampening&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Different dampening for different peers
# Typically done via route-map

configure

# More aggressive for untrusted peers
set policy route-map DAMPENING-AGGRESSIVE rule 10 action permit
set policy route-map DAMPENING-AGGRESSIVE rule 10 set dampening half-life 10
set policy route-map DAMPENING-AGGRESSIVE rule 10 set dampening re-use 500
set policy route-map DAMPENING-AGGRESSIVE rule 10 set dampening start-suppress-time 1500

# Apply to peer
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-map import DAMPENING-AGGRESSIVE

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing Dampening Status&lt;/h2&gt;
&lt;h3&gt;Show Dampened Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all dampened routes
show bgp ipv4 unicast dampening dampened-paths

# Output:
# Network          From             Reuse    Path
# 203.0.113.0/24   10.0.0.2        00:25:00 65002 65003
# 198.51.100.0/24  10.0.0.2        00:45:00 65002 65004
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Flap Statistics&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show routes with flap history
show bgp ipv4 unicast dampening flap-statistics

# Output:
# Network          From             Flaps Duration Reuse    Path
# 203.0.113.0/24   10.0.0.2        15    01:30:00 00:25:00 65002
# 198.51.100.0/24  10.0.0.2        8     00:45:00 00:45:00 65002
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Dampening Parameters&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show configured parameters
show bgp ipv4 unicast dampening parameters

# Output:
# Half-life time: 15 minutes
# Reuse penalty: 750
# Suppress penalty: 2000
# Max suppress time: 60 minutes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Clearing Dampening&lt;/h2&gt;
&lt;h3&gt;Clear Specific Route&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Clear dampening for specific prefix
clear bgp ipv4 unicast dampening 203.0.113.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Clear All Dampened Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Clear all dampening history
clear bgp ipv4 unicast dampening
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Parameter Tuning&lt;/h2&gt;
&lt;h3&gt;Aggressive Dampening&lt;/h3&gt;
&lt;p&gt;For untrusted peers or known-unstable sources:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Quick to suppress, slow to recover
set protocols bgp address-family ipv4-unicast dampening half-life 10
set protocols bgp address-family ipv4-unicast dampening re-use 500
set protocols bgp address-family ipv4-unicast dampening start-suppress-time 1000
set protocols bgp address-family ipv4-unicast dampening max-suppress-time 60

# Suppress after ~1-2 flaps
# Stay suppressed for up to 1 hour
# Be very stable to return

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Lenient Dampening&lt;/h3&gt;
&lt;p&gt;For trusted peers or critical routes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Harder to suppress, quick to recover
set protocols bgp address-family ipv4-unicast dampening half-life 20
set protocols bgp address-family ipv4-unicast dampening re-use 1000
set protocols bgp address-family ipv4-unicast dampening start-suppress-time 3000
set protocols bgp address-family ipv4-unicast dampening max-suppress-time 30

# Suppress after ~3+ flaps
# Return to use faster
# Maximum 30 minutes suppression

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Calculating Flaps to Suppress&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Suppress threshold / Penalty per flap = Flaps to suppress

Example:
2000 / 1000 = 2 flaps → suppressed

With decay during flapping, actual number varies
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Selective Dampening&lt;/h2&gt;
&lt;h3&gt;Dampen Only Specific Prefixes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Only dampen longer prefixes (more specific routes)
set policy prefix-list DAMPEN-TARGETS rule 10 prefix 0.0.0.0/0 ge 24

set policy route-map SELECTIVE-DAMPEN rule 10 match ip address prefix-list DAMPEN-TARGETS
set policy route-map SELECTIVE-DAMPEN rule 10 action permit
set policy route-map SELECTIVE-DAMPEN rule 10 set dampening half-life 15

set policy route-map SELECTIVE-DAMPEN rule 20 action permit
# No dampening for other routes

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Exclude Critical Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Don&apos;t dampen default route or critical prefixes
set policy prefix-list NO-DAMPEN rule 10 prefix 0.0.0.0/0
set policy prefix-list NO-DAMPEN rule 20 prefix 8.8.8.0/24  # Critical DNS

set policy route-map SAFE-DAMPEN rule 10 match ip address prefix-list NO-DAMPEN
set policy route-map SAFE-DAMPEN rule 10 action permit
# No dampening for matched routes

set policy route-map SAFE-DAMPEN rule 20 action permit
set policy route-map SAFE-DAMPEN rule 20 set dampening half-life 15

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Dampening Considerations&lt;/h2&gt;
&lt;h3&gt;When to Use Dampening&lt;/h3&gt;
&lt;p&gt;✓ Edge routers receiving external routes
✓ Networks with known flapping sources
✓ Protection against propagating instability&lt;/p&gt;
&lt;h3&gt;When Not to Use Dampening&lt;/h3&gt;
&lt;p&gt;✗ Internal BGP (iBGP) — hides real problems
✗ Critical routes where availability is paramount
✗ Networks with known slow convergence (dampening adds to it)&lt;/p&gt;
&lt;h3&gt;Dampening vs BFD&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;BFD: Detect failures FAST
Dampening: Suppress UNSTABLE routes

They solve different problems:
- BFD: &quot;Quickly know when peer is dead&quot;
- Dampening: &quot;Don&apos;t trust peers that keep dying&quot;

Use both together for:
- Fast failure detection (BFD)
- Protection from flapping (dampening)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring and Alerting&lt;/h2&gt;
&lt;h3&gt;Monitor Dampening Events&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Log when routes are suppressed/reused
show log | grep -i dampen

# Track frequently dampened prefixes
show bgp ipv4 unicast dampening flap-statistics
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Alert on Persistent Dampening&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If route stays dampened, investigate source
# Persistent dampening = persistent instability somewhere

# Check which peer is source
show bgp ipv4 unicast 203.0.113.0/24
# Note the &quot;from&quot; peer
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Route Stays Suppressed&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check current penalty and reuse time
show bgp ipv4 unicast dampening dampened-paths

# If penalty not decaying:
# - Recent flaps reset penalty
# - Half-life too long

# Manual clear if needed
clear bgp ipv4 unicast dampening 203.0.113.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Expected Route Not Appearing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Might be dampened
show bgp ipv4 unicast dampening dampened-paths | grep &amp;lt;prefix&amp;gt;

# If dampened, wait for reuse or clear
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Dampening Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify dampening is enabled
show bgp ipv4 unicast dampening parameters

# Check route-map is applied
show configuration commands | grep dampening
show configuration commands | grep route-map
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Start Conservative&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Default parameters are reasonable starting point
set protocols bgp address-family ipv4-unicast dampening

# Monitor before tuning
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Different Parameters for Different Sources&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Tier 1 transit: Lenient (trusted)
# Tier 2 transit: Standard
# Peers: Standard
# Customers: Lenient (you control their stability)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Monitor Dampening Statistics&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Regular review of what&apos;s being dampened
# Persistent dampening = investigate root cause
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Don&apos;t Dampen Everything&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Critical routes shouldn&apos;t be dampened
# Internal routes shouldn&apos;t be dampened
# Only dampen external routes from untrusted/unknown sources
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Dampening prevents unstable routes from destabilizing your network.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Without dampening:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flapping route → constant updates&lt;/li&gt;
&lt;li&gt;Updates propagate to all peers&lt;/li&gt;
&lt;li&gt;CPU and memory consumed processing junk&lt;/li&gt;
&lt;li&gt;Potentially affects routing for stable routes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With dampening:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flapping route → penalty accumulates&lt;/li&gt;
&lt;li&gt;After threshold → route suppressed&lt;/li&gt;
&lt;li&gt;Network ignores unstable route&lt;/li&gt;
&lt;li&gt;Stable routes unaffected&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tradeoff: Dampening delays convergence for routes that are legitimately changing. A real path change looks like a flap. Too aggressive dampening can hide valid routes.&lt;/p&gt;
&lt;p&gt;Balance: Conservative dampening on external routes, no dampening on internal/critical routes.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>networking</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>ECMP and Multipath: Load Balancing at the Routing Layer</title><link>https://ashimov.com/posts/vyos-ecmp/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-ecmp/</guid><description>Configure ECMP on VyOS for route-level load balancing. Covers equal-cost paths, multipath BGP, hash algorithms, troubleshooting uneven distribution, and why ECMP is simple but requires understanding.</description><pubDate>Fri, 16 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Two paths to the same destination. Same cost. Traditional routing picks one. ECMP (Equal-Cost Multi-Path) uses both, spreading traffic across available paths.&lt;/p&gt;
&lt;p&gt;The concept is simple: multiple equal routes, traffic distributed. The implementation has nuances: how traffic is distributed, what makes routes &quot;equal,&quot; and why some flows always use the same path.&lt;/p&gt;
&lt;p&gt;ECMP is simple but requires understanding to use effectively.&lt;/p&gt;
&lt;h2&gt;How ECMP Works&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Without ECMP:
                    [Path A: cost 10] → (used)
Host → Router →
                    [Path B: cost 10] → (ignored)

With ECMP:
                    [Path A: cost 10] → (50% traffic)
Host → Router →
                    [Path B: cost 10] → (50% traffic)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Traffic is distributed per-flow, not per-packet. All packets for the same flow use the same path (preventing reordering).&lt;/p&gt;
&lt;h2&gt;Basic ECMP Configuration&lt;/h2&gt;
&lt;h3&gt;Static Routes ECMP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Two equal-cost static routes
set protocols static route 10.0.0.0/8 next-hop 192.168.1.1
set protocols static route 10.0.0.0/8 next-hop 192.168.1.2

# VyOS automatically installs both if costs equal
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify ECMP Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show ip route 10.0.0.0/8

# Output:
# S&amp;gt;* 10.0.0.0/8 [1/0] via 192.168.1.1, eth0, weight 1, 00:05:00
#                 via 192.168.1.2, eth1, weight 1, 00:05:00
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Multiple next-hops shown = ECMP active.&lt;/p&gt;
&lt;h2&gt;ECMP with BGP&lt;/h2&gt;
&lt;h3&gt;Enable Multipath&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable ECMP for eBGP
set protocols bgp address-family ipv4-unicast maximum-paths ebgp 4

# Enable ECMP for iBGP
set protocols bgp address-family ipv4-unicast maximum-paths ibgp 4

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP Path Requirements&lt;/h3&gt;
&lt;p&gt;For BGP paths to be ECMP-eligible, they must have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Same AS_PATH length&lt;/li&gt;
&lt;li&gt;Same origin (IGP/EGP/incomplete)&lt;/li&gt;
&lt;li&gt;Same MED (or MED comparison disabled)&lt;/li&gt;
&lt;li&gt;Same local preference&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# Compare paths
show bgp ipv4 unicast 10.0.0.0/8

# If paths differ in AS_PATH length, not ECMP-eligible
# Path 1: AS_PATH 65001 65002 (length 2)
# Path 2: AS_PATH 65001 (length 1)  ← shorter, wins alone
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Allow Multipath from Same AS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# For multiple connections to same AS
set protocols bgp address-family ipv4-unicast maximum-paths ebgp 4
set protocols bgp address-family ipv4-unicast multipath-relax

# multipath-relax: Allows ECMP even if AS_PATH differs (same length)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ECMP with OSPF&lt;/h2&gt;
&lt;h3&gt;Enable OSPF ECMP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF supports ECMP by default
# Configure maximum paths
set protocols ospf parameters maximum-paths 4

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;OSPF naturally creates ECMP when multiple paths have equal cost.&lt;/p&gt;
&lt;h2&gt;Hash Algorithm&lt;/h2&gt;
&lt;p&gt;Traffic distribution uses hash of packet headers. Same hash = same path.&lt;/p&gt;
&lt;h3&gt;Hash Inputs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Default hash inputs:
- Source IP
- Destination IP
- Source port
- Destination port
- Protocol

Hash result → selects path
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure Hash Algorithm&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VyOS uses kernel&apos;s fib_multipath_hash_policy
# 0 = Layer 3 only (src/dst IP)
# 1 = Layer 4 (src/dst IP + ports)
# 2 = Layer 3 or inner for tunnels

configure
set system sysctl parameter net.ipv4.fib_multipath_hash_policy value 1
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Layer 3 vs Layer 4 Hash&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Layer 3 only:
# Same src/dst IP pair always uses same path
# Different src IPs spread across paths

# Layer 4:
# Same src/dst IP but different ports can use different paths
# Better distribution for single-host scenarios
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting ECMP&lt;/h2&gt;
&lt;h3&gt;Issue: Uneven Distribution&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# One path getting most traffic

# Causes:
# 1. Hash algorithm + traffic pattern = uneven
# 2. Not actually ECMP (one path preferred)
# 3. Few unique flows (small sample)

# Check if actually ECMP
show ip route 10.0.0.0/8
# Must show multiple next-hops

# Monitor per-path traffic
# Use interface counters
watch -n 1 &apos;show interfaces ethernet eth0; show interfaces ethernet eth1&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Issue: Single Flow Always Same Path&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# This is expected behavior!
# ECMP hashes per-flow, not per-packet

# Same src/dst/port always hashes to same path
# Prevents packet reordering

# For testing, use different source ports
nc -p 10001 server 80
nc -p 10002 server 80
# May use different paths
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Issue: Paths Not Equal&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# BGP paths not becoming ECMP

show bgp ipv4 unicast 10.0.0.0/8 bestpath

# Check what makes them unequal:
# - AS_PATH length different?
# - MED different?
# - Local preference different?

# Fix the inequality or enable multipath-relax
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Issue: Route Flapping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# One path keeps appearing/disappearing

# ECMP recalculates when paths change
# Can cause flow redistribution

# Solution: Stabilize the flapping path
# Or implement dampening
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Weighted ECMP&lt;/h2&gt;
&lt;p&gt;Not all paths are equal? Use weights:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Higher weight = more traffic
set protocols static route 10.0.0.0/8 next-hop 192.168.1.1 distance 1
set protocols static route 10.0.0.0/8 next-hop 192.168.1.2 distance 1

# Unfortunately, VyOS static routes don&apos;t have weight directly
# Use different administrative distance for preference (not ECMP)

# For weighted distribution, consider:
# - BGP with different link bandwidths
# - Policy routing with firewall marks
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ECMP Failure Handling&lt;/h2&gt;
&lt;h3&gt;When One Path Fails&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ECMP automatically removes failed path
# Traffic redistributes to remaining paths

# Flow rehashing happens:
# - Some flows move to different paths
# - Brief reordering possible during transition
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BFD for Fast Failure Detection&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Use BFD to quickly detect path failure
set protocols bfd peer 192.168.1.1
set protocols bfd peer 192.168.1.2

# When BFD detects failure, route withdrawn immediately
# ECMP recalculates faster than waiting for routing protocol
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ECMP Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Match Bandwidth&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ECMP assumes equal paths
# 10G + 1G ECMP = poor utilization

# Either:
# - Use paths with equal bandwidth
# - Use weighted/unequal ECMP if available
# - Different approach (LAG, policy routing)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Enable Layer 4 Hash&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Better distribution for typical traffic
set system sysctl parameter net.ipv4.fib_multipath_hash_policy value 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Monitor Both Paths&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Dashboard showing:
# - Traffic per path
# - Errors per path
# - ECMP route status
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Test Failover&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Regularly test:
# 1. Disable one path
# 2. Verify traffic flows via remaining path
# 3. Re-enable path
# 4. Verify ECMP resumes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ECMP vs LAG&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;ECMP&lt;/th&gt;
&lt;th&gt;LAG (Bond)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Layer&lt;/td&gt;
&lt;td&gt;3 (routing)&lt;/td&gt;
&lt;td&gt;2 (switching)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protocols&lt;/td&gt;
&lt;td&gt;Different paths, routers&lt;/td&gt;
&lt;td&gt;Same path, one hop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure detection&lt;/td&gt;
&lt;td&gt;Routing protocol&lt;/td&gt;
&lt;td&gt;LACP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Routing config&lt;/td&gt;
&lt;td&gt;Interface config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Many paths&lt;/td&gt;
&lt;td&gt;Limited ports&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;# ECMP: Different next-hop routers
# LAG: Same router, bundled interfaces

# Use LAG for link aggregation to single device
# Use ECMP for path diversity across network
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Checking ECMP Status&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# View kernel routing table
ip route show 10.0.0.0/8

# Show with ECMP details
ip route show 10.0.0.0/8 | grep -i nexthop

# Count ECMP paths
ip route show 10.0.0.0/8 | grep -c nexthop

# Test which path a flow would take
ip route get 10.0.0.100 from 192.168.1.50
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;ECMP is simple but requires understanding to use effectively.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;What ECMP gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automatic load distribution across equal paths&lt;/li&gt;
&lt;li&gt;Redundancy (path failure → automatic reroute)&lt;/li&gt;
&lt;li&gt;Increased aggregate bandwidth&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What ECMP doesn&apos;t give you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Per-packet load balancing (would cause reordering)&lt;/li&gt;
&lt;li&gt;Intelligent traffic distribution (hash-based, may be uneven)&lt;/li&gt;
&lt;li&gt;Weighted distribution (standard ECMP is equal-cost)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Key understanding:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Paths must be truly equal (cost, metrics)&lt;/li&gt;
&lt;li&gt;Distribution is per-flow, not per-packet&lt;/li&gt;
&lt;li&gt;Hash algorithm determines distribution&lt;/li&gt;
&lt;li&gt;Layer 4 hash usually gives better distribution&lt;/li&gt;
&lt;li&gt;Uneven traffic is normal with few flows&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Configure it, understand it, monitor it. ECMP works well when you know what to expect.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>ospf</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Route Leaking Between VRFs: Controlled Connectivity</title><link>https://ashimov.com/posts/vyos-vrf-leaking/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-vrf-leaking/</guid><description>Configure route leaking between VRFs on VyOS. Covers import/export policies, selective leaking, shared services, and why route leaking provides controlled cross-VRF connectivity.</description><pubDate>Tue, 13 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;VRFs isolate routing tables. Customer A can&apos;t reach Customer B. But what about shared services? The DNS server, the management network, the internet gateway — they need to be reachable from all VRFs.&lt;/p&gt;
&lt;p&gt;Route leaking imports routes from one VRF into another. Not all routes — just the ones you explicitly allow. DNS server in VRF SHARED becomes reachable from VRF CUSTOMER-A without full interconnection.&lt;/p&gt;
&lt;p&gt;Route leaking provides controlled cross-VRF connectivity.&lt;/p&gt;
&lt;h2&gt;Why Route Leaking&lt;/h2&gt;
&lt;h3&gt;Without Route Leaking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;VRF CUSTOMER-A:  10.1.0.0/16
VRF CUSTOMER-B:  10.2.0.0/16
VRF SHARED:      10.100.0.0/16 (DNS, NTP, etc.)

Problem: Customers can&apos;t reach shared services
Solution 1: NAT (complex, breaks some apps)
Solution 2: Route leaking (cleaner)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;With Route Leaking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;VRF CUSTOMER-A:
  - 10.1.0.0/16 (local)
  - 10.100.0.0/16 (leaked from SHARED)

VRF SHARED:
  - 10.100.0.0/16 (local)
  - 10.1.0.0/16 (leaked from CUSTOMER-A, if needed)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Basic Route Leaking&lt;/h2&gt;
&lt;h3&gt;VRF Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create VRFs
set vrf name CUSTOMER-A table 10
set vrf name SHARED table 20

# Assign interfaces
set interfaces ethernet eth1 vrf CUSTOMER-A
set interfaces ethernet eth2 vrf SHARED

# Configure addresses
set interfaces ethernet eth1 address 10.1.0.1/24
set interfaces ethernet eth2 address 10.100.0.1/24

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Leak Routes from SHARED to CUSTOMER-A&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define what to leak (prefix list)
set policy prefix-list SHARED-SERVICES rule 10 prefix 10.100.0.0/16
set policy prefix-list SHARED-SERVICES rule 10 action permit

# Route map for import
set policy route-map IMPORT-SHARED rule 10 match ip address prefix-list SHARED-SERVICES
set policy route-map IMPORT-SHARED rule 10 action permit

# Import into CUSTOMER-A from SHARED
set vrf name CUSTOMER-A protocols static route 10.100.0.0/16 interface eth2 vrf SHARED

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using BGP for Route Leaking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BGP in each VRF
set vrf name CUSTOMER-A protocols bgp system-as 65000
set vrf name SHARED protocols bgp system-as 65000

# Route distinguisher (unique per VRF)
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast rd 65000:10
set vrf name SHARED protocols bgp address-family ipv4-unicast rd 65000:20

# Route targets for import/export
# CUSTOMER-A imports from SHARED
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target import 65000:20
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target export 65000:10

# SHARED exports to all customers
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target export 65000:20

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Selective Route Leaking&lt;/h2&gt;
&lt;h3&gt;Leak Only Specific Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Only leak DNS server
set policy prefix-list DNS-ONLY rule 10 prefix 10.100.0.10/32
set policy prefix-list DNS-ONLY rule 10 action permit

# Route map for selective import
set policy route-map IMPORT-DNS rule 10 match ip address prefix-list DNS-ONLY
set policy route-map IMPORT-DNS rule 10 action permit

# Apply to import
# (Implementation depends on leaking method)

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Leak with Modified Attributes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Import but with lower preference
set policy route-map IMPORT-BACKUP rule 10 action permit
set policy route-map IMPORT-BACKUP rule 10 set local-preference 50

# Routes leaked as backup paths

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Bidirectional Leaking&lt;/h2&gt;
&lt;h3&gt;Customer Needs to Reach Shared, Shared Needs to Reach Customer&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# CUSTOMER-A → SHARED
set vrf name SHARED protocols static route 10.1.0.0/24 interface eth1 vrf CUSTOMER-A

# SHARED → CUSTOMER-A
set vrf name CUSTOMER-A protocols static route 10.100.0.0/24 interface eth2 vrf SHARED

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;With Route Targets (Symmetric)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# CUSTOMER-A
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target import 65000:20
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target export 65000:10

# SHARED
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target import 65000:10
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target export 65000:20

# Now bidirectional

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Patterns&lt;/h2&gt;
&lt;h3&gt;Pattern 1: Shared Services VRF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;        ┌─────────────────┐
        │   VRF SHARED    │
        │  DNS, NTP, etc  │
        └────────┬────────┘
                 │ (leaked to all)
    ┌────────────┼────────────┐
    │            │            │
┌───┴───┐   ┌───┴───┐   ┌───┴───┐
│ VRF A │   │ VRF B │   │ VRF C │
└───────┘   └───────┘   └───────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Leak SHARED to all customer VRFs
# Each customer VRF imports 65000:SHARED
# SHARED exports to all
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern 2: Internet Gateway&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;        ┌─────────────────┐
        │   VRF INTERNET  │
        │   (default gw)  │
        └────────┬────────┘
                 │ (default route leaked)
    ┌────────────┼────────────┐
    │            │            │
┌───┴───┐   ┌───┴───┐   ┌───┴───┐
│ VRF A │   │ VRF B │   │ VRF C │
└───────┘   └───────┘   └───────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;configure

# Internet VRF has default route
set vrf name INTERNET protocols static route 0.0.0.0/0 next-hop 203.0.113.1

# Leak default to customers
set vrf name CUSTOMER-A protocols static route 0.0.0.0/0 next-hop 10.100.0.1 vrf INTERNET

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern 3: Management VRF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;        ┌─────────────────┐
        │   VRF MGMT      │
        │  (admin access) │
        └────────┬────────┘
                 │ (limited leak)
    ┌────────────┼────────────┐
    │            │            │
┌───┴───┐   ┌───┴───┐   ┌───┴───┐
│ VRF A │   │ VRF B │   │ VRF C │
└───────┘   └───────┘   └───────┘

Management can reach all VRFs
VRFs cannot reach management
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# Asymmetric: MGMT can reach customers
set vrf name MGMT protocols static route 10.1.0.0/16 interface eth1 vrf CUSTOMER-A
set vrf name MGMT protocols static route 10.2.0.0/16 interface eth2 vrf CUSTOMER-B

# But customers don&apos;t have route to MGMT
# (no reverse leak)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Preventing Leakage Problems&lt;/h2&gt;
&lt;h3&gt;Problem: Overlapping Addresses&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# CUSTOMER-A: 10.0.0.0/8
# CUSTOMER-B: 10.0.0.0/8  (same!)

# Can&apos;t leak between them - address collision

# Solution: NAT before leaking, or don&apos;t overlap
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem: Routing Loops&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VRF A leaks to VRF B
# VRF B leaks to VRF C
# VRF C leaks to VRF A

# If routes propagate, possible loop

# Solution: Mark leaked routes, don&apos;t re-leak
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem: Unintended Connectivity&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Leaked too much - customers can now reach each other

# Solution: Use strict prefix lists
set policy prefix-list SHARED-ONLY rule 10 prefix 10.100.0.0/24
set policy prefix-list SHARED-ONLY rule 10 action permit
set policy prefix-list SHARED-ONLY rule 999 action deny

# Only leak exactly what&apos;s needed
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verifying Route Leaking&lt;/h2&gt;
&lt;h3&gt;Check VRF Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show routes in specific VRF
show ip route vrf CUSTOMER-A

# Look for routes with different VRF next-hop
# 10.100.0.0/24 via 10.100.0.1, eth2 (vrf SHARED)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check BGP VPN Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If using BGP for leaking
show bgp ipv4 vpn
show bgp ipv4 vpn rd 65000:10
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test Connectivity&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Ping across VRFs
ping 10.100.0.10 vrf CUSTOMER-A

# Traceroute across VRFs
traceroute 10.100.0.10 vrf CUSTOMER-A
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Routes Not Appearing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check source VRF has the route
show ip route vrf SHARED

# Check route target configuration
show vrf CUSTOMER-A
# Look for RT import/export

# Check prefix list matches
show policy prefix-list SHARED-SERVICES
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Connectivity Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Routes exist but traffic fails

# Check return path exists
show ip route vrf SHARED 10.1.0.0/24
# Shared must have route back to customer

# Check firewall allows cross-VRF
show firewall
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP VPN Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check RD uniqueness
show bgp ipv4 vpn summary

# Check RT import/export match
# Export RT on one side must match import RT on other
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Document Leaking Policy&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Route Leaking Policy

## VRF SHARED (10.100.0.0/16)
- Leaked TO: All customer VRFs
- Leaked FROM: None (customers can&apos;t reach each other via SHARED)

## VRF INTERNET (0.0.0.0/0)
- Leaked TO: Customer VRFs with internet package
- Leaked FROM: All (for return traffic)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Use Prefix Lists&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Never leak &quot;everything&quot;
# Explicit prefix lists only

set policy prefix-list LEAK-TO-CUSTOMER rule 10 prefix 10.100.0.0/24
set policy prefix-list LEAK-TO-CUSTOMER rule 20 prefix 10.100.1.0/24
set policy prefix-list LEAK-TO-CUSTOMER rule 999 action deny
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Consider Security&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Leaked routes bypass VRF isolation
# Firewall rules should exist at leak points

set firewall ipv4 name SHARED-TO-CUSTOMER rule 10 action accept
set firewall ipv4 name SHARED-TO-CUSTOMER rule 10 destination port 53
set firewall ipv4 name SHARED-TO-CUSTOMER rule 10 protocol udp
# Only DNS allowed
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Monitor Leaked Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Regular check that only expected routes are leaked
show ip route vrf CUSTOMER-A | grep &quot;vrf SHARED&quot;

# Alert if unexpected routes appear
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Route leaking provides controlled cross-VRF connectivity.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;VRFs isolate by default. Sometimes you need controlled exceptions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Shared DNS server&lt;/li&gt;
&lt;li&gt;Internet gateway&lt;/li&gt;
&lt;li&gt;Management access&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Route leaking gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Selective connectivity (only what you allow)&lt;/li&gt;
&lt;li&gt;Clear separation (customers still isolated from each other)&lt;/li&gt;
&lt;li&gt;Flexibility (import/export per VRF)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key word is &lt;em&gt;controlled&lt;/em&gt;. Leak only what&apos;s necessary. Use prefix lists. Verify bidirectional if needed.&lt;/p&gt;
&lt;p&gt;VRFs without route leaking: Perfect isolation
VRFs with careless leaking: No isolation
VRFs with careful leaking: Controlled exceptions&lt;/p&gt;
&lt;p&gt;Design your leaking policy before implementing. Document what goes where and why.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>BGP Communities: Signaling Intent Across Networks</title><link>https://ashimov.com/posts/vyos-bgp-communities/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-bgp-communities/</guid><description>Master BGP communities on VyOS. Covers standard, extended, and large communities, common use cases, community-based filtering, and why communities are the language networks speak.</description><pubDate>Fri, 09 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;BGP routes carry more than just prefix and next-hop. Communities are tags attached to routes, signaling intent to other networks. &quot;This route is for backup only.&quot; &quot;Prepend this route 3 times to peers.&quot; &quot;Don&apos;t announce this outside the region.&quot;&lt;/p&gt;
&lt;p&gt;Without communities, you&apos;d need separate sessions, manual filters, or constant coordination. With communities, you tag once, everyone who understands acts accordingly.&lt;/p&gt;
&lt;p&gt;Communities are the language networks speak to each other.&lt;/p&gt;
&lt;h2&gt;Community Types&lt;/h2&gt;
&lt;h3&gt;Standard Communities&lt;/h3&gt;
&lt;p&gt;32-bit value, formatted as &lt;code&gt;ASN:value&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;65000:100  - AS 65000, value 100
65000:1000 - AS 65000, value 1000
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Well-Known Communities&lt;/h3&gt;
&lt;p&gt;Predefined, universal meaning:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Community&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;no-export&lt;/td&gt;
&lt;td&gt;65535:65281&lt;/td&gt;
&lt;td&gt;Don&apos;t export outside AS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no-advertise&lt;/td&gt;
&lt;td&gt;65535:65282&lt;/td&gt;
&lt;td&gt;Don&apos;t advertise to any peer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;local-as&lt;/td&gt;
&lt;td&gt;65535:65283&lt;/td&gt;
&lt;td&gt;Don&apos;t export outside local confederation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no-peer&lt;/td&gt;
&lt;td&gt;65535:65284&lt;/td&gt;
&lt;td&gt;Don&apos;t advertise to EBGP peers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Extended Communities&lt;/h3&gt;
&lt;p&gt;64-bit, more structure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;rt:65000:100     - Route Target
soo:65000:100    - Site of Origin
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Large Communities&lt;/h3&gt;
&lt;p&gt;96-bit for 4-byte ASNs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;4200000000:1:100  - Global Admin : Local Data 1 : Local Data 2
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Matching Communities&lt;/h2&gt;
&lt;h3&gt;Define Community List&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Match single community
set policy community-list BACKUP rule 10 regex &quot;65000:100&quot;

# Match any from set
set policy community-list CUSTOMER rule 10 regex &quot;65000:1[0-9][0-9]&quot;

# Match well-known
set policy community-list NO-EXPORT rule 10 regex &quot;no-export&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use in Route Map&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Match routes with community
set policy route-map FILTER-IN rule 10 match community community-list BACKUP
set policy route-map FILTER-IN rule 10 action permit
set policy route-map FILTER-IN rule 10 set local-preference 50

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Setting Communities&lt;/h2&gt;
&lt;h3&gt;Add Community to Route&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Set community when advertising
set policy route-map SET-COMMUNITY rule 10 action permit
set policy route-map SET-COMMUNITY rule 10 set community &quot;65000:100&quot;

# Add community (keep existing)
set policy route-map ADD-COMMUNITY rule 10 action permit
set policy route-map ADD-COMMUNITY rule 10 set community &quot;65000:200 additive&quot;

# Set multiple communities
set policy route-map MULTI-COMMUNITY rule 10 action permit
set policy route-map MULTI-COMMUNITY rule 10 set community &quot;65000:100 65000:200&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Apply to Neighbor&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Apply route-map to neighbor
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-map export SET-COMMUNITY
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Use Cases&lt;/h2&gt;
&lt;h3&gt;Use Case 1: Traffic Engineering&lt;/h3&gt;
&lt;p&gt;Tell upstream how to prefer your routes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Community convention with ISP:
# 65000:90 = set local-pref 90 (less preferred)
# 65000:100 = set local-pref 100 (normal)
# 65000:110 = set local-pref 110 (more preferred)

configure

# Mark backup link routes as less preferred
set policy route-map TO-ISP-BACKUP rule 10 action permit
set policy route-map TO-ISP-BACKUP rule 10 set community &quot;65000:90&quot;

set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-map export TO-ISP-BACKUP

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 2: Prepending Control&lt;/h3&gt;
&lt;p&gt;Ask upstream to prepend your routes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Community convention:
# 65000:3001 = prepend 1x to all peers
# 65000:3002 = prepend 2x to all peers
# 65000:3003 = prepend 3x to all peers

configure

# Request 2x prepend on backup routes
set policy route-map PREPEND-REQUEST rule 10 action permit
set policy route-map PREPEND-REQUEST rule 10 set community &quot;65000:3002&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 3: Regional Filtering&lt;/h3&gt;
&lt;p&gt;Announce only within region:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Community convention:
# 65000:1000 = US region
# 65000:2000 = EU region
# 65000:3000 = APAC region

configure

# Mark route as US-only
set policy route-map US-ONLY rule 10 action permit
set policy route-map US-ONLY rule 10 set community &quot;65000:1000 no-export&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 4: Customer vs Peer vs Transit&lt;/h3&gt;
&lt;p&gt;Tag routes by source:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Internal convention:
# 65000:100 = customer route
# 65000:200 = peer route
# 65000:300 = transit route

configure

# Tag customer routes
set policy route-map FROM-CUSTOMER rule 10 action permit
set policy route-map FROM-CUSTOMER rule 10 set community &quot;65000:100&quot;

# Use for policy decisions
set policy route-map TO-PEER rule 10 match community community-list CUSTOMER
set policy route-map TO-PEER rule 10 action permit
# Only advertise customer routes to peers

set policy route-map TO-PEER rule 20 action deny
# Deny transit routes to peers (no transit)

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use Case 5: Blackhole&lt;/h3&gt;
&lt;p&gt;Signal upstream to blackhole traffic:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Standard blackhole community (check with provider)
# Many ISPs use: ISP_ASN:666

configure

# Mark route for blackholing
set policy route-map BLACKHOLE rule 10 action permit
set policy route-map BLACKHOLE rule 10 match ip address prefix-list ATTACK-PREFIX
set policy route-map BLACKHOLE rule 10 set community &quot;65000:666&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Extended Communities&lt;/h2&gt;
&lt;h3&gt;Route Targets (for VRF/VPN)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Import routes with specific RT
set protocols bgp address-family ipv4-vpn
set vrf name CUSTOMER-A rd 65000:1
set vrf name CUSTOMER-A route-target import 65000:1
set vrf name CUSTOMER-A route-target export 65000:1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Site of Origin&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Prevent routing loops in multi-homed sites
# Routes from site won&apos;t be sent back to same site

set policy route-map SET-SOO rule 10 action permit
set policy route-map SET-SOO rule 10 set extcommunity soo &quot;65000:100&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Large Communities&lt;/h2&gt;
&lt;p&gt;For networks with 4-byte ASNs or needing more structure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define large community list
set policy large-community-list CUSTOMER rule 10 regex &quot;4200000000:1:.*&quot;

# Set large community
set policy route-map SET-LARGE rule 10 action permit
set policy route-map SET-LARGE rule 10 set large-community &quot;4200000000:1:100&quot;

# Match large community
set policy route-map MATCH-LARGE rule 10 match large-community large-community-list CUSTOMER

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing Communities&lt;/h2&gt;
&lt;h3&gt;Show Communities on Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show BGP routes with communities
show bgp ipv4 unicast community

# Show specific prefix with communities
show bgp ipv4 unicast 203.0.113.0/24

# Output includes:
# Community: 65000:100 65000:200

# Filter by community
show bgp ipv4 unicast community 65000:100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Show Community Lists&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show defined community lists
show policy community-list
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Stripping Communities&lt;/h2&gt;
&lt;h3&gt;Remove Specific Communities&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Delete specific community
set policy route-map STRIP-INTERNAL rule 10 action permit
set policy route-map STRIP-INTERNAL rule 10 set community delete community-list INTERNAL

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Remove All Communities&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Remove all communities (nuclear option)
set policy route-map STRIP-ALL rule 10 action permit
set policy route-map STRIP-ALL rule 10 set community none

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Community Design Principles&lt;/h2&gt;
&lt;h3&gt;1. Document Your Scheme&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Community Scheme for AS65000

## Route Type (65000:1xx)
- 65000:100 = Customer route
- 65000:110 = Peer route
- 65000:120 = Transit route

## Traffic Engineering (65000:2xx)
- 65000:200 = Normal preference
- 65000:210 = Higher preference
- 65000:220 = Lower preference

## Regional (65000:3xx)
- 65000:300 = All regions
- 65000:310 = US only
- 65000:320 = EU only

## Action Requests (65000:4xx)
- 65000:410 = Prepend 1x
- 65000:420 = Prepend 2x
- 65000:430 = Prepend 3x
- 65000:499 = Blackhole
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Use Consistent Patterns&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Good: Predictable scheme
# 65000:1xxx = route type
# 65000:2xxx = preference
# 65000:3xxx = regional
# 65000:4xxx = actions

# Bad: Random assignment
# 65000:42 = customer
# 65000:7 = blackhole
# 65000:1234 = US
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Don&apos;t Trust External Communities&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Strip customer communities on ingress
set policy route-map FROM-CUSTOMER rule 1 action permit
set policy route-map FROM-CUSTOMER rule 1 set community delete community-list ALL-INTERNAL
set policy route-map FROM-CUSTOMER rule 1 set community &quot;65000:100 additive&quot;

# Then apply customer tag
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Communities are the language networks speak to each other.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Without communities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Manual coordination for traffic engineering&lt;/li&gt;
&lt;li&gt;Separate sessions for different policies&lt;/li&gt;
&lt;li&gt;No way to signal intent across ASes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With communities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tag routes with meaning&lt;/li&gt;
&lt;li&gt;Upstream acts on tags automatically&lt;/li&gt;
&lt;li&gt;Complex policies become simple&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Design your community scheme before you need it. Document it. Use consistent numbering. Make it extensible.&lt;/p&gt;
&lt;p&gt;Communities scale your network&apos;s communication without scaling your operational overhead.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>networking</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Network Automation with Ansible: From Manual CLI to Infrastructure as Code</title><link>https://ashimov.com/posts/network-automation-ansible/</link><guid isPermaLink="true">https://ashimov.com/posts/network-automation-ansible/</guid><description>A practical guide to automating network infrastructure using Ansible. Real examples from production environments including device configuration, backup strategies, and compliance checking.</description><pubDate>Thu, 08 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;After years of manually configuring network devices through CLI, I made the switch to network automation. The transformation wasn&apos;t just about saving time — it fundamentally changed how we approach network operations. Here&apos;s what I learned implementing Ansible-based automation across enterprise networks.&lt;/p&gt;
&lt;h2&gt;Why Network Automation Matters&lt;/h2&gt;
&lt;p&gt;The traditional approach to network management has serious limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Configuration drift&lt;/strong&gt;: Manual changes accumulate inconsistencies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human error&lt;/strong&gt;: Typos in CLI commands cause outages&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit challenges&lt;/strong&gt;: No clear record of who changed what and when&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Slow disaster recovery&lt;/strong&gt;: Rebuilding configurations from scratch takes hours&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Automation addresses all of these while enabling practices that were previously impractical.&lt;/p&gt;
&lt;h2&gt;Starting with Ansible for Networks&lt;/h2&gt;
&lt;p&gt;Ansible works differently for network devices than for servers. Most network gear doesn&apos;t run Python, so Ansible connects via SSH and sends commands directly. This makes it accessible even for legacy equipment.&lt;/p&gt;
&lt;h3&gt;Basic Inventory Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# inventory/hosts.ini
[core_switches]
core-sw-01 ansible_host=10.0.1.1
core-sw-02 ansible_host=10.0.1.2

[access_switches]
access-sw-[01:24] ansible_host=10.0.2.{{ item }}

[firewalls]
fw-primary ansible_host=10.0.0.1
fw-secondary ansible_host=10.0.0.2

[all:vars]
ansible_network_os=ios
ansible_connection=network_cli
ansible_user=automation
ansible_ssh_private_key_file=~/.ssh/network_automation
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Group Variables for Configuration Standards&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# group_vars/all.yml
---
ntp_servers:
  - 10.0.0.10
  - 10.0.0.11

dns_servers:
  - 10.0.0.20
  - 10.0.0.21

syslog_servers:
  - 10.0.0.30

snmp_community: &quot;{{ vault_snmp_community }}&quot;

banner_motd: |
  ********************************************
  *  AUTHORIZED ACCESS ONLY                  *
  *  All activity is monitored and logged    *
  ********************************************
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Practical Playbook Examples&lt;/h2&gt;
&lt;h3&gt;Configuration Backup&lt;/h3&gt;
&lt;p&gt;This is often the first automation task — backing up all device configurations nightly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# playbooks/backup_configs.yml
---
- name: Backup Network Device Configurations
  hosts: all
  gather_facts: no

  vars:
    backup_root: &quot;/backup/network/{{ inventory_hostname }}&quot;

  tasks:
    - name: Create backup directory
      delegate_to: localhost
      file:
        path: &quot;{{ backup_root }}&quot;
        state: directory

    - name: Get current configuration
      ios_command:
        commands: show running-config
      register: config_output

    - name: Save configuration to file
      delegate_to: localhost
      copy:
        content: &quot;{{ config_output.stdout[0] }}&quot;
        dest: &quot;{{ backup_root }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.cfg&quot;

    - name: Compare with previous backup
      delegate_to: localhost
      shell: |
        latest=$(ls -t {{ backup_root }}/*.cfg | head -2 | tail -1)
        current={{ backup_root }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.cfg
        if [ -f &quot;$latest&quot; ] &amp;amp;&amp;amp; [ &quot;$latest&quot; != &quot;$current&quot; ]; then
          diff -u &quot;$latest&quot; &quot;$current&quot; || true
        fi
      register: config_diff
      changed_when: config_diff.stdout != &quot;&quot;

    - name: Send notification if config changed
      delegate_to: localhost
      mail:
        to: netops@company.com
        subject: &quot;Config change detected: {{ inventory_hostname }}&quot;
        body: &quot;{{ config_diff.stdout }}&quot;
      when: config_diff.changed
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Standardizing NTP Configuration&lt;/h3&gt;
&lt;p&gt;Ensuring all devices use the correct time servers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# playbooks/configure_ntp.yml
---
- name: Standardize NTP Configuration
  hosts: all
  gather_facts: no

  tasks:
    - name: Remove existing NTP servers
      ios_config:
        lines:
          - no ntp server {{ item }}
      loop: &quot;{{ existing_ntp_servers | default([]) }}&quot;
      when: existing_ntp_servers is defined

    - name: Configure standard NTP servers
      ios_config:
        lines:
          - ntp server {{ item }} prefer
      loop: &quot;{{ ntp_servers }}&quot;

    - name: Set timezone
      ios_config:
        lines:
          - clock timezone UTC 0

    - name: Verify NTP synchronization
      ios_command:
        commands: show ntp status
      register: ntp_status

    - name: Report NTP status
      debug:
        msg: &quot;NTP synchronized: {{ &apos;Clock is synchronized&apos; in ntp_status.stdout[0] }}&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VLAN Deployment Across Multiple Switches&lt;/h3&gt;
&lt;p&gt;When adding a new VLAN to multiple devices:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# playbooks/deploy_vlan.yml
---
- name: Deploy VLAN to Access Switches
  hosts: access_switches
  gather_facts: no

  vars_prompt:
    - name: vlan_id
      prompt: &quot;Enter VLAN ID&quot;
      private: no
    - name: vlan_name
      prompt: &quot;Enter VLAN name&quot;
      private: no

  tasks:
    - name: Create VLAN
      ios_vlans:
        config:
          - vlan_id: &quot;{{ vlan_id | int }}&quot;
            name: &quot;{{ vlan_name }}&quot;
            state: active
        state: merged

    - name: Verify VLAN creation
      ios_command:
        commands: &quot;show vlan id {{ vlan_id }}&quot;
      register: vlan_check

    - name: Display result
      debug:
        var: vlan_check.stdout_lines
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Compliance Checking&lt;/h2&gt;
&lt;p&gt;Beyond configuration, Ansible can verify devices meet security standards:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# playbooks/compliance_check.yml
---
- name: Security Compliance Audit
  hosts: all
  gather_facts: no

  vars:
    compliance_report: []

  tasks:
    - name: Check SSH version 2
      ios_command:
        commands: show ip ssh
      register: ssh_config

    - name: Verify SSHv2 enabled
      set_fact:
        compliance_report: &quot;{{ compliance_report + [{&apos;check&apos;: &apos;SSH Version&apos;, &apos;status&apos;: &apos;PASS&apos; if &apos;SSH version 2&apos; in ssh_config.stdout[0] else &apos;FAIL&apos;}] }}&quot;

    - name: Check password encryption
      ios_command:
        commands: show running-config | include service password
      register: password_config

    - name: Verify password encryption
      set_fact:
        compliance_report: &quot;{{ compliance_report + [{&apos;check&apos;: &apos;Password Encryption&apos;, &apos;status&apos;: &apos;PASS&apos; if &apos;service password-encryption&apos; in password_config.stdout[0] else &apos;FAIL&apos;}] }}&quot;

    - name: Check login banner
      ios_command:
        commands: show running-config | section banner
      register: banner_config

    - name: Verify login banner exists
      set_fact:
        compliance_report: &quot;{{ compliance_report + [{&apos;check&apos;: &apos;Login Banner&apos;, &apos;status&apos;: &apos;PASS&apos; if banner_config.stdout[0] | length &amp;gt; 10 else &apos;FAIL&apos;}] }}&quot;

    - name: Check unused ports disabled
      ios_command:
        commands: show interfaces status | include notconnect
      register: unused_ports

    - name: Generate compliance report
      delegate_to: localhost
      template:
        src: compliance_report.j2
        dest: &quot;/reports/compliance/{{ inventory_hostname }}_{{ ansible_date_time.date }}.html&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Handling Different Vendors&lt;/h2&gt;
&lt;p&gt;Real networks have equipment from multiple vendors. Ansible handles this with platform-specific modules:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# playbooks/multi_vendor_backup.yml
---
- name: Multi-Vendor Configuration Backup
  hosts: all
  gather_facts: no

  tasks:
    - name: Backup Cisco IOS
      ios_command:
        commands: show running-config
      register: config
      when: ansible_network_os == &apos;ios&apos;

    - name: Backup Cisco NX-OS
      nxos_command:
        commands: show running-config
      register: config
      when: ansible_network_os == &apos;nxos&apos;

    - name: Backup Juniper JunOS
      junos_command:
        commands: show configuration
      register: config
      when: ansible_network_os == &apos;junos&apos;

    - name: Backup Arista EOS
      eos_command:
        commands: show running-config
      register: config
      when: ansible_network_os == &apos;eos&apos;

    - name: Save configuration
      delegate_to: localhost
      copy:
        content: &quot;{{ config.stdout[0] }}&quot;
        dest: &quot;/backup/{{ inventory_hostname }}.cfg&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Error Handling and Rollback&lt;/h2&gt;
&lt;p&gt;Network changes need careful error handling:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# playbooks/safe_config_change.yml
---
- name: Safe Configuration Change with Rollback
  hosts: &quot;{{ target_device }}&quot;
  gather_facts: no
  serial: 1

  tasks:
    - name: Backup current config
      ios_command:
        commands: show running-config
      register: backup_config

    - name: Save backup locally
      delegate_to: localhost
      copy:
        content: &quot;{{ backup_config.stdout[0] }}&quot;
        dest: &quot;/tmp/{{ inventory_hostname }}_rollback.cfg&quot;

    - name: Apply configuration changes
      ios_config:
        src: &quot;{{ config_template }}&quot;
        save_when: never
      register: config_result

    - name: Verify connectivity after change
      wait_for:
        host: &quot;{{ ansible_host }}&quot;
        port: 22
        timeout: 30
      delegate_to: localhost
      register: connectivity
      ignore_errors: yes

    - name: Rollback if connectivity lost
      block:
        - name: Wait for device recovery
          wait_for:
            host: &quot;{{ ansible_host }}&quot;
            port: 22
            timeout: 300
          delegate_to: localhost

        - name: Restore previous configuration
          ios_config:
            src: &quot;/tmp/{{ inventory_hostname }}_rollback.cfg&quot;

        - name: Notify about rollback
          mail:
            to: netops@company.com
            subject: &quot;ROLLBACK: {{ inventory_hostname }}&quot;
            body: &quot;Configuration change failed and was rolled back&quot;
          delegate_to: localhost
      when: connectivity is failed

    - name: Save configuration if successful
      ios_command:
        commands: write memory
      when: connectivity is succeeded
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Integration with CI/CD&lt;/h2&gt;
&lt;p&gt;Network changes can flow through the same CI/CD pipelines as application code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# .gitlab-ci.yml
stages:
  - validate
  - test
  - deploy

validate_syntax:
  stage: validate
  script:
    - ansible-playbook --syntax-check playbooks/*.yml
    - ansible-lint playbooks/

test_in_lab:
  stage: test
  script:
    - ansible-playbook -i inventory/lab playbooks/deploy_changes.yml
    - ansible-playbook -i inventory/lab playbooks/run_tests.yml
  environment:
    name: lab

deploy_production:
  stage: deploy
  script:
    - ansible-playbook -i inventory/production playbooks/deploy_changes.yml
  environment:
    name: production
  when: manual
  only:
    - main
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Lessons Learned&lt;/h2&gt;
&lt;p&gt;After implementing automation across several networks:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Start small&lt;/strong&gt;: Begin with read-only tasks like backups and compliance checks. Build confidence before making changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version control everything&lt;/strong&gt;: Playbooks, inventory, variables — all in Git. This provides audit trail and enables code review for network changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Test in lab first&lt;/strong&gt;: Even simple playbooks can have unexpected effects. A lab environment (even virtual) is essential.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use check mode&lt;/strong&gt;: Always run with &lt;code&gt;--check&lt;/code&gt; first in production to see what would change.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Document the manual fallback&lt;/strong&gt;: Automation will fail eventually. Document how to perform critical tasks manually.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monitor playbook execution&lt;/strong&gt;: Log all runs, track success rates, alert on failures.&lt;/p&gt;
&lt;h2&gt;The Bigger Picture&lt;/h2&gt;
&lt;p&gt;Network automation isn&apos;t just about efficiency — it&apos;s about treating network infrastructure with the same rigor as application code. When configurations are versioned, tested, and deployed through pipelines, networks become more reliable and easier to manage.&lt;/p&gt;
&lt;p&gt;The shift from &quot;network engineer who uses CLI&quot; to &quot;network engineer who writes code&quot; isn&apos;t always comfortable, but it&apos;s increasingly necessary. The skills transfer — understanding of networking fundamentals remains essential, you&apos;re just expressing that knowledge differently.&lt;/p&gt;
&lt;p&gt;Start with backups. Move to compliance checking. Then tackle configuration standardization. Each step builds confidence for the next. Before long, you&apos;ll wonder how you managed without it.&lt;/p&gt;
</content:encoded><category>automation</category><category>ansible</category><category>networking</category><category>devops</category><category>infrastructure</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Building Reliable Infrastructure: Lessons from 15 Years in Operations</title><link>https://ashimov.com/posts/Building%20Reliable%20Infrastructure:%20Lessons%20from%2015%20Years%20in%20Operations/</link><guid isPermaLink="true">https://ashimov.com/posts/Building%20Reliable%20Infrastructure:%20Lessons%20from%2015%20Years%20in%20Operations/</guid><description>Key principles for building infrastructure that survives failures, scales gracefully, and lets you sleep at night. Real lessons from production environments.</description><pubDate>Wed, 07 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;import Callout from &apos;@/components/mdx/Callout.astro&apos;;&lt;/p&gt;
&lt;p&gt;After 15 years of managing production infrastructure — from small business servers to enterprise payment systems with 99.99% uptime requirements — I&apos;ve learned that reliability isn&apos;t about preventing failures. It&apos;s about designing systems that handle failures gracefully.&lt;/p&gt;
&lt;h2&gt;The Three Pillars of Reliable Infrastructure&lt;/h2&gt;
&lt;p&gt;Every reliable system I&apos;ve built or maintained rests on three pillars:&lt;/p&gt;
&lt;h3&gt;1. Eliminate Single Points of Failure&lt;/h3&gt;
&lt;p&gt;The question isn&apos;t &quot;will this component fail?&quot; but &quot;when it fails, what happens?&quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Network layer:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dual uplinks with BGP or VRRP failover&lt;/li&gt;
&lt;li&gt;Redundant switches in stack or MLAG configuration&lt;/li&gt;
&lt;li&gt;Multiple paths to critical services&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Compute layer:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VM anti-affinity rules across hypervisor hosts&lt;/li&gt;
&lt;li&gt;Database replicas in different failure domains&lt;/li&gt;
&lt;li&gt;Load balancers in active-passive or active-active pairs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Storage layer:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RAID for local storage (RAID 10 for databases)&lt;/li&gt;
&lt;li&gt;Replicated storage backends&lt;/li&gt;
&lt;li&gt;Off-site backups with tested recovery procedures&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;lt;Callout type=&quot;warning&quot; title=&quot;Common Mistake&quot;&amp;gt;
Redundancy without automatic failover is just expensive complexity. If someone needs to SSH in at 3 AM to switch traffic, your redundancy has failed.
&amp;lt;/Callout&amp;gt;&lt;/p&gt;
&lt;h3&gt;2. Monitor What Matters&lt;/h3&gt;
&lt;p&gt;I&apos;ve seen monitoring setups with 10,000 metrics where teams still miss critical outages. The problem isn&apos;t lack of data — it&apos;s lack of focus.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Effective monitoring hierarchy:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Business metrics (revenue, user transactions)
    ↓
Service metrics (latency, error rates, throughput)
    ↓
Infrastructure metrics (CPU, memory, disk, network)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Start from the top. If business metrics are healthy, infrastructure alerts can wait. If business metrics drop, you need to know immediately — even if all infrastructure metrics look green.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Key principles:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Alert on symptoms, not causes&lt;/li&gt;
&lt;li&gt;Every alert should be actionable&lt;/li&gt;
&lt;li&gt;If you ignore an alert twice, fix or delete it&lt;/li&gt;
&lt;li&gt;On-call should get fewer than 5 pages per week&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Standardize Everything&lt;/h3&gt;
&lt;p&gt;The most reliable environments I&apos;ve managed weren&apos;t the most sophisticated — they were the most boring. Same OS version everywhere. Same configuration management. Same deployment process.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What to standardize:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Base OS images with security hardening&lt;/li&gt;
&lt;li&gt;Network configurations (use templates)&lt;/li&gt;
&lt;li&gt;Monitoring and logging agents&lt;/li&gt;
&lt;li&gt;Backup schedules and retention&lt;/li&gt;
&lt;li&gt;Patch management windows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Faster troubleshooting (you&apos;ve seen this before)&lt;/li&gt;
&lt;li&gt;Easier automation (one playbook fits all)&lt;/li&gt;
&lt;li&gt;Reduced cognitive load (less to remember)&lt;/li&gt;
&lt;li&gt;Simpler compliance (consistent baselines)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Real-World Example: Payment System Architecture&lt;/h2&gt;
&lt;p&gt;One system I helped design processes financial transactions across two data centers. Here&apos;s what makes it reliable:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Primary&lt;/th&gt;
&lt;th&gt;Failover&lt;/th&gt;
&lt;th&gt;RTO&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;DC1 (sync replica)&lt;/td&gt;
&lt;td&gt;DC2 (async replica)&lt;/td&gt;
&lt;td&gt;&amp;lt; 30s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Application&lt;/td&gt;
&lt;td&gt;Active-active&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load Balancer&lt;/td&gt;
&lt;td&gt;DC1&lt;/td&gt;
&lt;td&gt;DC2 (DNS failover)&lt;/td&gt;
&lt;td&gt;&amp;lt; 60s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;ISP A + ISP B&lt;/td&gt;
&lt;td&gt;BGP rerouting&lt;/td&gt;
&lt;td&gt;&amp;lt; 10s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Key design decisions:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Synchronous replication&lt;/strong&gt; for transactions (data consistency over availability)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Asynchronous replication&lt;/strong&gt; for reporting databases (availability over consistency)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Health checks every 5 seconds&lt;/strong&gt; with 3 failures before failover&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automated failover&lt;/strong&gt; for network and load balancers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manual failover&lt;/strong&gt; for database (intentional — prevents split-brain)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&amp;lt;Callout type=&quot;info&quot; title=&quot;The 99.99% Reality&quot;&amp;gt;
99.99% uptime means less than 53 minutes of downtime per year. That&apos;s about 4 minutes per month. Every design decision must account for this budget.
&amp;lt;/Callout&amp;gt;&lt;/p&gt;
&lt;h2&gt;Operational Practices That Actually Work&lt;/h2&gt;
&lt;p&gt;Beyond architecture, these practices have saved me countless times:&lt;/p&gt;
&lt;h3&gt;Change Management&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;No changes on Fridays (or before holidays)&lt;/li&gt;
&lt;li&gt;All changes documented and reversible&lt;/li&gt;
&lt;li&gt;Staged rollouts: dev → staging → canary → production&lt;/li&gt;
&lt;li&gt;Post-change monitoring period (15-30 minutes minimum)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Incident Response&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Clear escalation paths documented&lt;/li&gt;
&lt;li&gt;Runbooks for common failures (not just &quot;restart the service&quot;)&lt;/li&gt;
&lt;li&gt;Blameless postmortems focused on systemic improvements&lt;/li&gt;
&lt;li&gt;Regular disaster recovery drills (at least quarterly)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Capacity Planning&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Track growth trends monthly&lt;/li&gt;
&lt;li&gt;Provision for 2x expected peak load&lt;/li&gt;
&lt;li&gt;Set alerts at 70% capacity (time to plan expansion)&lt;/li&gt;
&lt;li&gt;Review capacity quarterly with business stakeholders&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What I&apos;ve Learned&lt;/h2&gt;
&lt;p&gt;If I could summarize 15 years into a few principles:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Simple systems fail less&lt;/strong&gt; — Every component is a potential failure point&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automation saves you at 3 AM&lt;/strong&gt; — If it&apos;s not automated, it won&apos;t happen correctly under pressure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation is for future you&lt;/strong&gt; — Write it like you&apos;ll be on vacation when things break&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test your backups&lt;/strong&gt; — Untested backups are just hopes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Learn from every incident&lt;/strong&gt; — The same failure twice is an organizational failure&lt;/li&gt;
&lt;/ol&gt;
&lt;hr /&gt;
&lt;p&gt;This is the first post on my new blog. I&apos;ll be sharing more operational knowledge — monitoring setups, network automation, security practices, and lessons from real incidents.&lt;/p&gt;
&lt;p&gt;Questions or topics you&apos;d like me to cover? Reach out on &lt;a href=&quot;https://www.linkedin.com/in/berik-ashimov/&quot;&gt;LinkedIn&lt;/a&gt; or &lt;a href=&quot;https://t.me/B3r1k&quot;&gt;Telegram&lt;/a&gt;.&lt;/p&gt;
</content:encoded><category>infrastructure</category><category>reliability</category><category>operations</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Graceful Restart: Maintaining Forwarding During Protocol Restarts</title><link>https://ashimov.com/posts/vyos-graceful-restart/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-graceful-restart/</guid><description>Configure OSPF and BGP graceful restart on VyOS. Covers GR mechanics, helper mode, restart timers, and why graceful restart prevents traffic loss during maintenance.</description><pubDate>Tue, 06 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;You need to restart the routing daemon. Maybe for upgrade, maybe for config reload. Normal behavior: neighbors detect restart, withdraw routes, traffic reroutes. Convergence takes seconds to minutes.&lt;/p&gt;
&lt;p&gt;Graceful restart keeps forwarding while the control plane restarts. Neighbors know you&apos;re restarting (not dead) and keep routes. Data plane continues forwarding. After restart, routing state resynchronizes. No traffic loss.&lt;/p&gt;
&lt;p&gt;Graceful restart prevents traffic loss during planned maintenance.&lt;/p&gt;
&lt;h2&gt;How Graceful Restart Works&lt;/h2&gt;
&lt;h3&gt;Normal Restart (Without GR)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Router A routing daemon restarts
2. Router B detects adjacency down
3. Router B withdraws all routes from A
4. Traffic reconverges to alternate paths
5. Router A comes back up
6. Adjacency re-established
7. Routes re-learned
8. Traffic returns to original path

Impact: Minutes of reconvergence, possible blackhole
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Graceful Restart&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Router A signals &quot;entering graceful restart&quot;
2. Router A daemon restarts, forwarding plane continues
3. Router B (helper) keeps routes, marks them &quot;stale&quot;
4. Router A comes back up quickly
5. Router A re-establishes adjacency, refreshes routes
6. Router B removes &quot;stale&quot; flag
7. No route withdrawal, no reconvergence

Impact: Near-zero traffic disruption
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BGP Graceful Restart&lt;/h2&gt;
&lt;h3&gt;Basic Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable graceful restart for BGP
set protocols bgp parameters graceful-restart

# Optional: Set restart time (how long helper waits)
set protocols bgp parameters graceful-restart restart-time 120

# Optional: Set stalepath time (how long to keep stale routes)
set protocols bgp parameters graceful-restart stalepath-time 360

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Per-Neighbor Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable/disable per neighbor
set protocols bgp neighbor 10.0.0.2 capability graceful-restart

# Some neighbors might not support GR
# Disable for specific neighbor:
set protocols bgp neighbor 10.0.0.3 capability graceful-restart disable
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP GR Timers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Timer&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;restart-time&lt;/td&gt;
&lt;td&gt;Time helper waits for restart&lt;/td&gt;
&lt;td&gt;120s&lt;/td&gt;
&lt;td&gt;1-4095s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stalepath-time&lt;/td&gt;
&lt;td&gt;Time to keep stale routes&lt;/td&gt;
&lt;td&gt;360s&lt;/td&gt;
&lt;td&gt;1-4095s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;# Adjust timers
set protocols bgp parameters graceful-restart restart-time 180
set protocols bgp parameters graceful-restart stalepath-time 600
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify BGP GR&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check neighbor capabilities
show bgp neighbors 10.0.0.2

# Look for:
# Graceful Restart Capability: advertised and received
# Remote Restart timer is 120 seconds
# Address families by peer:
#   IPv4 Unicast(Preserved)

# Check current GR state
show bgp neighbors 10.0.0.2 graceful-restart
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;OSPF Graceful Restart&lt;/h2&gt;
&lt;h3&gt;Basic Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable OSPF graceful restart
set protocols ospf graceful-restart

# Set grace period
set protocols ospf graceful-restart grace-period 180

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF GR Helper Mode&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Helper mode (support other routers restarting)
set protocols ospf graceful-restart helper enable

# Can restrict helper mode
set protocols ospf graceful-restart helper strict-lsa-checking
# If LSA changes during restart, exit GR (safer)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF GR Timers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Timer&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;grace-period&lt;/td&gt;
&lt;td&gt;Time to complete restart&lt;/td&gt;
&lt;td&gt;180s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;# Adjust grace period
set protocols ospf graceful-restart grace-period 300
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify OSPF GR&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check OSPF graceful restart status
show ip ospf graceful-restart

# Check neighbor state during restart
show ip ospf neighbor

# During GR, neighbor might show special state
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Testing Graceful Restart&lt;/h2&gt;
&lt;h3&gt;Test BGP GR&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Terminal 1: Watch BGP neighbor
watch -n 1 &apos;vtysh -c &quot;show bgp neighbors 10.0.0.2&quot;&apos;

# Terminal 2: Restart BGP
systemctl restart frr

# Observe:
# - Neighbor should stay established (or show &quot;Restart&quot;)
# - Routes should not be withdrawn
# - Quick re-establishment
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test OSPF GR&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Terminal 1: Watch OSPF neighbor
watch -n 1 &apos;vtysh -c &quot;show ip ospf neighbor&quot;&apos;

# Terminal 2: Restart OSPF
systemctl restart frr

# Observe:
# - Neighbor should not go Down
# - Routes should persist
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify Forwarding Continues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# From another host, continuous ping through router
ping -i 0.1 destination-through-router

# During restart:
# Without GR: Packet loss during convergence
# With GR: Zero or minimal packet loss
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Long-Lived Graceful Restart (LLGR)&lt;/h2&gt;
&lt;p&gt;For BGP, LLGR extends stale route retention:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable LLGR
set protocols bgp parameters graceful-restart long-lived

# Set LLGR stale time (much longer than regular)
set protocols bgp parameters graceful-restart long-lived stale-time 86400

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;LLGR keeps routes even longer, with lower preference (community added). Useful for edge cases where restart takes very long.&lt;/p&gt;
&lt;h2&gt;Graceful Restart vs BFD&lt;/h2&gt;
&lt;p&gt;They serve different purposes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Graceful Restart&lt;/th&gt;
&lt;th&gt;BFD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Purpose&lt;/td&gt;
&lt;td&gt;Survive planned restarts&lt;/td&gt;
&lt;td&gt;Detect failures fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trigger&lt;/td&gt;
&lt;td&gt;Control plane restart&lt;/td&gt;
&lt;td&gt;Link/peer failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response&lt;/td&gt;
&lt;td&gt;Keep routes&lt;/td&gt;
&lt;td&gt;Withdraw routes fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use together&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;# Use both!
# BFD: Detect actual failures quickly
# GR: Survive planned restarts

set protocols bgp neighbor 10.0.0.2 bfd
set protocols bgp parameters graceful-restart
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;When GR Doesn&apos;t Help&lt;/h2&gt;
&lt;h3&gt;Unplanned Failures&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Router crashes (not graceful)
# Forwarding plane also fails
# GR signal never sent

# Solution: BFD detects quickly, traffic reroutes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Forwarding Plane Restart&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If forwarding (kernel/hardware) restarts:
# GR won&apos;t help - traffic still disrupted

# GR only helps when:
# - Control plane (FRR) restarts
# - Forwarding (kernel routes) continues
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configuration Changes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Major config change might require route refresh anyway
# GR preserves old routes, but new config applies

# Be careful: GR might keep stale config briefly
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting GR&lt;/h2&gt;
&lt;h3&gt;GR Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check if GR capability exchanged
show bgp neighbors 10.0.0.2 | grep -i graceful

# If &quot;not received&quot;:
# - Peer doesn&apos;t support GR
# - Peer has GR disabled

# Check OSPF GR status
show ip ospf graceful-restart

# If disabled, check config:
show configuration commands | grep graceful
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Routes Withdrawn Anyway&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Possible causes:
# 1. Restart took too long (exceeded timer)
# 2. Helper router cleared routes
# 3. GR not properly negotiated

# Check timers
show bgp neighbors 10.0.0.2 | grep -i timer
show bgp neighbors 10.0.0.2 | grep -i restart

# Increase restart-time if needed
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Helper Not Preserving Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check helper configuration
show configuration commands | grep helper

# OSPF might need explicit helper mode
set protocols ospf graceful-restart helper enable
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Enable on All Routers&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# GR is peer-to-peer negotiation
# Both sides should have it enabled

# Without GR on peer:
# - Your restart withdraws routes from peer
# - Peer&apos;s restart withdraws routes from you
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Test Before Production&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Test GR in lab/staging
# Verify:
# - Capabilities exchanged
# - Routes preserved during restart
# - Forwarding continues
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Monitor During Maintenance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# During planned restart, monitor:
show bgp summary
show ip ospf neighbor

# Watch for state changes
# Verify quick re-establishment
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Tune Timers for Your Environment&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Fast restart (SSD, modern hardware)
set protocols bgp parameters graceful-restart restart-time 60

# Slow restart (older hardware, large config)
set protocols bgp parameters graceful-restart restart-time 300
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Configuration Summary&lt;/h2&gt;
&lt;h3&gt;BGP Graceful Restart&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Basic GR
set protocols bgp parameters graceful-restart

# Timers
set protocols bgp parameters graceful-restart restart-time 120
set protocols bgp parameters graceful-restart stalepath-time 360

# Per-neighbor (optional)
set protocols bgp neighbor 10.0.0.2 capability graceful-restart

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF Graceful Restart&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Basic GR
set protocols ospf graceful-restart
set protocols ospf graceful-restart grace-period 180

# Helper mode
set protocols ospf graceful-restart helper enable

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Graceful restart prevents traffic loss during planned maintenance.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Without GR:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Daemon restart = all routes withdrawn&lt;/li&gt;
&lt;li&gt;Traffic reconverges (seconds to minutes)&lt;/li&gt;
&lt;li&gt;Users see disruption&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With GR:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Daemon restart signaled to neighbors&lt;/li&gt;
&lt;li&gt;Neighbors keep routes (marked stale)&lt;/li&gt;
&lt;li&gt;Forwarding continues&lt;/li&gt;
&lt;li&gt;Daemon comes back, routes refreshed&lt;/li&gt;
&lt;li&gt;Users notice nothing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Every production router should have graceful restart enabled. It&apos;s free insurance for maintenance windows.&lt;/p&gt;
&lt;p&gt;The 30 seconds you spend configuring GR saves minutes of disruption every time you restart the routing daemon.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>ha</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>BFD: Fast Failover Detection for Routing Protocols</title><link>https://ashimov.com/posts/vyos-bfd/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-bfd/</guid><description>Implement BFD on VyOS for sub-second failure detection. Covers BFD timers, integration with BGP and OSPF, multihop BFD, and why routing protocol keepalives are too slow.</description><pubDate>Fri, 02 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;BGP default keepalive: 60 seconds. Hold time: 180 seconds. That&apos;s 3 minutes before your router notices a peer is dead. Three minutes of blackholing traffic.&lt;/p&gt;
&lt;p&gt;OSPF default dead interval: 40 seconds. Better, but still 40 seconds of packets going nowhere.&lt;/p&gt;
&lt;p&gt;BFD (Bidirectional Forwarding Detection) runs alongside routing protocols, detecting failures in milliseconds. When BFD sees the neighbor is dead, it tells BGP/OSPF immediately. Failover happens in under a second.&lt;/p&gt;
&lt;p&gt;Routing protocol keepalives are too slow. BFD fixes this.&lt;/p&gt;
&lt;h2&gt;How BFD Works&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Normal state:
Router A ←→ Router B
BFD packets every 300ms
Both routers: &quot;Peer is alive&quot;

Failure:
Router A → X → Router B (link fails)
Router A: No BFD response for 900ms (3 × 300ms)
Router A: &quot;Peer is dead, notify BGP/OSPF&quot;
BGP/OSPF: Immediately withdraw routes
Total detection time: ~1 second
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;BFD is protocol-independent. It just says &quot;neighbor reachable&quot; or &quot;neighbor unreachable.&quot; Routing protocols react to this signal.&lt;/p&gt;
&lt;h2&gt;BFD Timers&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Typical Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;interval&lt;/td&gt;
&lt;td&gt;Time between BFD packets&lt;/td&gt;
&lt;td&gt;300ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;min-rx&lt;/td&gt;
&lt;td&gt;Minimum receive interval&lt;/td&gt;
&lt;td&gt;300ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;multiplier&lt;/td&gt;
&lt;td&gt;Missed packets before failure&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Detection time = interval × multiplier = 300ms × 3 = 900ms&lt;/p&gt;
&lt;h2&gt;Basic BFD Configuration&lt;/h2&gt;
&lt;h3&gt;Enable BFD Globally&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define BFD profile
set protocols bfd profile FAST interval 300
set protocols bfd profile FAST min-rx 300
set protocols bfd profile FAST multiplier 3

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BFD with BGP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Configure BGP neighbor
set protocols bgp neighbor 10.0.0.2 remote-as 65002
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast

# Enable BFD for this neighbor
set protocols bgp neighbor 10.0.0.2 bfd

# Or with specific profile
set protocols bgp neighbor 10.0.0.2 bfd profile FAST

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BFD with OSPF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Configure OSPF
set protocols ospf area 0 network 10.0.0.0/24

# Enable BFD for all OSPF neighbors (interface level)
set protocols ospf interface eth0 bfd

# Or enable globally for all OSPF interfaces
set protocols ospf parameters bfd all-interfaces

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multihop BFD&lt;/h2&gt;
&lt;p&gt;For eBGP peers not directly connected:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Multihop BGP neighbor
set protocols bgp neighbor 192.0.2.1 remote-as 65100
set protocols bgp neighbor 192.0.2.1 ebgp-multihop 3

# Multihop BFD (specify source)
set protocols bfd peer 192.0.2.1 source address 198.51.100.1
set protocols bfd peer 192.0.2.1 multihop
set protocols bfd peer 192.0.2.1 profile FAST

# Link BGP to BFD peer
set protocols bgp neighbor 192.0.2.1 bfd

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BFD Profiles&lt;/h2&gt;
&lt;p&gt;Create profiles for different use cases:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Aggressive (datacenter, low latency links)
set protocols bfd profile AGGRESSIVE interval 100
set protocols bfd profile AGGRESSIVE min-rx 100
set protocols bfd profile AGGRESSIVE multiplier 3
# Detection: 300ms

# Standard (most links)
set protocols bfd profile STANDARD interval 300
set protocols bfd profile STANDARD min-rx 300
set protocols bfd profile STANDARD multiplier 3
# Detection: 900ms

# Conservative (unstable links, prevent flapping)
set protocols bfd profile CONSERVATIVE interval 1000
set protocols bfd profile CONSERVATIVE min-rx 1000
set protocols bfd profile CONSERVATIVE multiplier 5
# Detection: 5 seconds

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Apply Profiles&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# BGP neighbor with specific profile
set protocols bgp neighbor 10.0.0.2 bfd profile AGGRESSIVE

# OSPF interface with specific profile
set protocols ospf interface eth0 bfd profile STANDARD
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring BFD&lt;/h2&gt;
&lt;h3&gt;View BFD Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all BFD peers
show bfd peers

# Output:
# BFD Peers:
#     peer 10.0.0.2
#         ID: 1234567890
#         Status: up
#         Uptime: 2 hours 15 minutes
#         Diagnostics: ok
#         Local timers:
#             Interval: 300ms
#             Echo interval: disabled
#             Multiplier: 3
#         Remote timers:
#             Interval: 300ms
#             Multiplier: 3
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;View BFD with Routing Protocol&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# BGP neighbor with BFD status
show bgp neighbors 10.0.0.2

# Look for:
# BFD: enabled
# BFD status: Up

# OSPF neighbor with BFD
show ip ospf neighbor

# BFD column shows: Up/Down
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BFD Counters&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show BFD statistics
show bfd peers counters

# Control packet statistics
# Session state change count
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Echo Mode&lt;/h2&gt;
&lt;p&gt;BFD echo mode reduces CPU load by having the remote peer echo packets back:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable echo mode
set protocols bfd peer 10.0.0.2 echo-mode

# Set echo interval
set protocols bfd peer 10.0.0.2 echo-interval 50

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Echo Mode Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Lower CPU usage (echo packets handled in fast path)&lt;/li&gt;
&lt;li&gt;Requires symmetric forwarding&lt;/li&gt;
&lt;li&gt;May not work across some network devices&lt;/li&gt;
&lt;li&gt;Not available for multihop BFD&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;BFD and High Availability&lt;/h2&gt;
&lt;h3&gt;BFD in Redundant Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;         ISP A
           |
     [10.0.0.2]
           |
    VyOS Router (BFD to both)
           |
     [10.0.1.2]
           |
         ISP B
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;configure

# Primary ISP - aggressive detection
set protocols bgp neighbor 10.0.0.2 remote-as 65001
set protocols bgp neighbor 10.0.0.2 bfd profile AGGRESSIVE

# Backup ISP - also fast detection
set protocols bgp neighbor 10.0.1.2 remote-as 65002
set protocols bgp neighbor 10.0.1.2 bfd profile AGGRESSIVE

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When primary fails, BFD detects in ~300ms, BGP converges, backup takes over.&lt;/p&gt;
&lt;h3&gt;BFD with VRRP&lt;/h3&gt;
&lt;p&gt;BFD can trigger faster VRRP failover:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Not directly integrated, but:
# - BFD detects link failure
# - Track script checks BFD status
# - VRRP priority adjusted based on BFD
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting BFD&lt;/h2&gt;
&lt;h3&gt;BFD Session Not Establishing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check if BFD packets are exchanged
sudo tcpdump -i eth0 udp port 3784

# BFD control: UDP port 3784
# BFD echo: UDP port 3785

# Common issues:
# - Firewall blocking BFD ports
# - Source address mismatch
# - Timer mismatch (negotiation fails)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BFD Flapping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Session up/down repeatedly
show log | grep -i bfd

# Causes:
# - Timers too aggressive for link quality
# - Congestion causing packet loss
# - MTU issues

# Solution: Increase timers
set protocols bfd profile STABLE interval 500
set protocols bfd profile STABLE multiplier 5
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;One-Way BFD&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# BFD shows &quot;Down&quot; but packets sent

# Check for asymmetric routing
# BFD packets might take different return path

# For multihop BFD, ensure:
# - Source address configured correctly
# - Routing is symmetric
# - TTL is sufficient
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BFD Firewall Rules&lt;/h2&gt;
&lt;p&gt;If firewall is enabled, allow BFD:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Allow BFD control packets
set firewall ipv4 name ROUTER-IN rule 20 action accept
set firewall ipv4 name ROUTER-IN rule 20 protocol udp
set firewall ipv4 name ROUTER-IN rule 20 destination port 3784
set firewall ipv4 name ROUTER-IN rule 20 description &quot;BFD Control&quot;

# Allow BFD echo packets
set firewall ipv4 name ROUTER-IN rule 21 action accept
set firewall ipv4 name ROUTER-IN rule 21 protocol udp
set firewall ipv4 name ROUTER-IN rule 21 destination port 3785
set firewall ipv4 name ROUTER-IN rule 21 description &quot;BFD Echo&quot;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Match Timers on Both Sides&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Both routers should have compatible timers
# BFD negotiates, but similar values work best

# Router A
set protocols bfd profile STANDARD interval 300
set protocols bfd profile STANDARD min-rx 300
set protocols bfd profile STANDARD multiplier 3

# Router B - same settings
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Consider Link Quality&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# High-quality datacenter links
# → Aggressive timers (100-300ms)

# WAN/Internet links
# → Conservative timers (500ms-1s)

# Satellite/high-latency links
# → Very conservative (1s+, higher multiplier)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Don&apos;t Be Too Aggressive&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# 50ms timers sound great until:
# - Minor congestion triggers failover
# - Route flapping destabilizes network
# - CPU can&apos;t keep up with BFD packets

# Start conservative, tune down if needed
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Monitor BFD State&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Alert on BFD state changes
# Track BFD flapping frequency
# Correlate with network events
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BFD Timer Calculation&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Detection Time = interval × multiplier

Examples:
100ms × 3 = 300ms detection
300ms × 3 = 900ms detection
500ms × 5 = 2.5s detection
1000ms × 3 = 3s detection

Compare to:
BGP default: 180 seconds
OSPF default: 40 seconds
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Routing protocol keepalives are too slow. BFD fixes this.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Without BFD:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BGP: 180 seconds to detect dead peer&lt;/li&gt;
&lt;li&gt;OSPF: 40 seconds to detect dead neighbor&lt;/li&gt;
&lt;li&gt;Traffic blackholed during detection&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With BFD:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detection in sub-second (300ms-1s typical)&lt;/li&gt;
&lt;li&gt;Routing protocols react immediately&lt;/li&gt;
&lt;li&gt;Failover happens before users notice&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;BFD is simple to configure, low overhead, and dramatically improves convergence time. Every production BGP session and OSPF adjacency should have BFD enabled.&lt;/p&gt;
&lt;p&gt;The only question is timer values: aggressive for reliable links, conservative for flaky links. Start with 300ms/3, adjust based on your network.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>ha</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Policy Routing Debug: Why Traffic Takes the Wrong Path</title><link>https://ashimov.com/posts/vyos-pbr-debug/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-pbr-debug/</guid><description>Debug policy-based routing on VyOS. Covers rule evaluation order, mark verification, table inspection, common misconfigurations, and why PBR debugging needs systematic verification.</description><pubDate>Tue, 30 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Policy routing configured. Traffic still takes the default route. You add more rules. Still doesn&apos;t work. You start guessing.&lt;/p&gt;
&lt;p&gt;Policy-based routing (PBR) is simple in concept but has multiple points of failure. Each must be correct: match criteria, firewall marks, routing tables, rule priority. Miss one, and traffic ignores your policy.&lt;/p&gt;
&lt;p&gt;PBR debugging needs systematic verification, not guessing.&lt;/p&gt;
&lt;h2&gt;PBR Components&lt;/h2&gt;
&lt;p&gt;Policy routing has four parts. All must be correct:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. Policy: Match traffic and set mark
   └── firewall rules with mark action

2. Mark: Identify traffic for routing decision
   └── fwmark value (0x1, 0x2, etc.)

3. Table: Alternative routing table
   └── custom routes separate from main

4. Rule: Match mark and use table
   └── ip rule connecting mark to table
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verification Workflow&lt;/h2&gt;
&lt;h3&gt;Step 1: Verify Policy Matches&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check policy is applied to interface
show configuration commands | grep policy

# Expected:
# set policy route VPN-TRAFFIC rule 10 set mark 0x1
# set interfaces ethernet eth1 policy route VPN-TRAFFIC

# Verify policy rules
show configuration commands | grep &quot;policy route&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Verify Traffic Gets Marked&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check firewall counters (if logging enabled)
show firewall

# Check if marking is happening with iptables
sudo iptables -t mangle -L -v -n

# Output should show packet counts on MARK rules:
# pkts bytes target     prot  in     out  source   destination
# 1234 5678K MARK       all   eth1   *    0.0.0.0/0  10.0.0.0/24   MARK set 0x1
#                                                                      ↑ Packets matched

# If counter is zero → policy not matching traffic
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Verify Routing Table Exists&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show custom routing table
show ip route table 10

# Or directly:
ip route show table 10

# Should show routes:
# default via 10.10.0.1 dev tun0
# 10.10.0.0/24 dev tun0 proto kernel scope link src 10.10.0.2
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Verify Rule Connects Mark to Table&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all rules
ip rule show

# Expected output:
# 0:      from all lookup local
# 32765:  from all fwmark 0x1 lookup 10    ← Your rule
# 32766:  from all lookup main
# 32767:  from all lookup default

# If your fwmark rule is missing → rule not created
# If rule priority is wrong → might be evaluated after main table
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 5: Test End-to-End&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Simulate marked packet lookup
ip route get 8.8.8.8 mark 0x1

# Expected:
# 8.8.8.8 via 10.10.0.1 dev tun0 table 10 mark 0x1

# If it shows main table route → mark not working
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Problems&lt;/h2&gt;
&lt;h3&gt;Problem 1: Policy Not Applied to Interface&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Traffic not marked

# Check interface has policy
show interfaces ethernet eth1

# Should show:
# policy {
#     route VPN-TRAFFIC
# }

# If missing:
configure
set interfaces ethernet eth1 policy route VPN-TRAFFIC
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 2: Wrong Match Criteria&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Policy exists but doesn&apos;t match traffic

# Show policy details
show configuration commands | grep &quot;policy route VPN-TRAFFIC&quot;

# Common mistakes:
# - Source instead of destination (or vice versa)
# - Wrong subnet mask
# - Wrong protocol/port
# - Rule disabled

# Test what traffic should match:
# Rule says: source 192.168.1.0/24, destination 10.0.0.0/8
# Traffic is: source 192.168.2.100 → Won&apos;t match!
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 3: Mark Not Set&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Rule matches but no mark

# Check iptables for mark rules
sudo iptables -t mangle -L PREROUTING -v -n

# Look for MARK target
# If MARK target shows 0 packets → not matching
# If no MARK rule → policy not generating iptables rules

# Verify mark is in policy:
show configuration commands | grep &quot;set mark&quot;
# set policy route VPN-TRAFFIC rule 10 set mark 0x1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 4: Table Missing or Empty&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Marked traffic uses main table

# Check table exists
ip route show table 10

# If empty or missing:
configure

# Add table (VyOS creates automatically with protocol static)
set protocols static table 10 route 0.0.0.0/0 next-hop 10.10.0.1

# Or for interface-based:
set protocols static table 10 route 0.0.0.0/0 interface tun0

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 5: Rule Priority Wrong&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Table has routes but not used

ip rule show
# 32765:  from all lookup main
# 32766:  from all fwmark 0x1 lookup 10    ← Too late!
# 32767:  from all lookup default

# Main table is checked before fwmark rule
# Traffic matches in main, never reaches your rule

# VyOS should set correct priority, but verify
# Lower number = higher priority
# fwmark rules should be before main (32766)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 6: Return Traffic Not Marked&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Outbound works, return traffic takes wrong path

# PBR typically marks only initiating direction
# Return traffic must be handled by:
# - Conntrack (automatic if stateful)
# - Separate marking rule for return

# Check if conntrack is preserving marks
sudo conntrack -L | grep mark

# Enable connection mark restore:
# Usually automatic with VyOS, but can verify in iptables
sudo iptables -t mangle -L -v
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Debugging Commands&lt;/h2&gt;
&lt;h3&gt;Check What Route Traffic Would Take&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Without mark (normal routing)
ip route get 8.8.8.8

# With mark (policy routing)
ip route get 8.8.8.8 mark 0x1

# Compare outputs to see if PBR is working
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Packet Counts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# How many packets matched policy?
sudo iptables -t mangle -L PREROUTING -v -n | grep MARK

# Reset counters and test
sudo iptables -t mangle -Z
# Generate test traffic
curl http://10.0.0.100/
# Check counters again
sudo iptables -t mangle -L PREROUTING -v -n | grep MARK
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Trace Packet Path&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable netfilter trace (temporary debug)
sudo modprobe nf_log_ipv4
sudo sysctl -w net.netfilter.nf_log.2=nf_log_ipv4

# Add trace rule for specific traffic
sudo iptables -t raw -A PREROUTING -s 192.168.1.100 -j TRACE

# Watch kernel log
dmesg -w

# Remove trace when done
sudo iptables -t raw -D PREROUTING -s 192.168.1.100 -j TRACE
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Firewall Flow&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# See where packet is in firewall processing
sudo iptables -t mangle -L -v -n  # Marking happens here
sudo iptables -t nat -L -v -n      # NAT happens here
sudo iptables -t filter -L -v -n   # Filtering happens here

# PBR marks in mangle PREROUTING
# Routing decision happens after mangle
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VyOS PBR Configuration Reference&lt;/h2&gt;
&lt;h3&gt;Complete Working Example&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# 1. Create routing table with routes
set protocols static table 10 route 0.0.0.0/0 next-hop 10.10.0.1

# 2. Create policy with mark
set policy route PBR-TO-VPN rule 10 destination address 10.0.0.0/8
set policy route PBR-TO-VPN rule 10 set mark 0x1
set policy route PBR-TO-VPN rule 10 set table 10

# 3. Apply policy to interface
set interfaces ethernet eth1 policy route PBR-TO-VPN

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify Each Component&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# 1. Table has routes
ip route show table 10
# → Should show default via 10.10.0.1

# 2. Policy creates iptables rules
sudo iptables -t mangle -L PREROUTING -v -n | grep -i mark
# → Should show MARK rule

# 3. IP rule connects mark to table
ip rule show | grep fwmark
# → Should show: fwmark 0x1 lookup 10

# 4. Test packet routing
ip route get 10.0.0.100 mark 0x1
# → Should show: via 10.10.0.1 table 10
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Advanced Debugging&lt;/h2&gt;
&lt;h3&gt;Multiple Tables&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If using multiple tables:
ip route show table 10
ip route show table 20

# Verify rules don&apos;t conflict:
ip rule show

# Each mark should have unique table
# 32765:  fwmark 0x1 lookup 10
# 32764:  fwmark 0x2 lookup 20
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Source-Based Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If routing by source:
set policy route BY-SOURCE rule 10 source address 192.168.1.0/24
set policy route BY-SOURCE rule 10 set table 10

# Verify source matches
sudo iptables -t mangle -L PREROUTING -v -n
# Should show source match
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;DSCP/TOS Marking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If matching on DSCP:
set policy route QOS-ROUTING rule 10 dscp 46
set policy route QOS-ROUTING rule 10 set table 10

# Verify packet has expected DSCP
sudo tcpdump -i eth1 -v | grep &quot;tos&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Testing Strategy&lt;/h2&gt;
&lt;h3&gt;Minimal Test&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# 1. Create simple policy
set policy route TEST rule 10 destination address 8.8.8.8/32
set policy route TEST rule 10 set table 10
set protocols static table 10 route 0.0.0.0/0 blackhole

# 2. Apply to interface
set interfaces ethernet eth1 policy route TEST

# 3. Test
ping 8.8.8.8  # Should fail (blackhole)
ping 8.8.4.4  # Should work (not matched)

# 4. Clean up
delete policy route TEST
delete interfaces ethernet eth1 policy route
delete protocols static table 10
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Incremental Testing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Test each component in order:

# Test 1: Does table work?
ip route add blackhole 8.8.8.8 table 10
ip route get 8.8.8.8  # Uses main table → should work
# Clean: ip route del blackhole 8.8.8.8 table 10

# Test 2: Does rule work?
ip rule add fwmark 0x99 table 10
ip route add blackhole 8.8.8.8 table 10
ip route get 8.8.8.8 mark 0x99  # Should show table 10, blackhole
# Clean: ip rule del fwmark 0x99; ip route del blackhole 8.8.8.8 table 10

# Test 3: Does policy create mark?
# Apply policy, check iptables counters
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;PBR debugging needs systematic verification, not guessing.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When policy routing doesn&apos;t work:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Verify policy applied&lt;/strong&gt; to correct interface&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify traffic matches&lt;/strong&gt; policy rules&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify mark is set&lt;/strong&gt; (check iptables counters)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify table exists&lt;/strong&gt; with correct routes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify rule connects&lt;/strong&gt; mark to table&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test with &lt;code&gt;ip route get ... mark&lt;/code&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step depends on the previous. One failure breaks everything after it.&lt;/p&gt;
&lt;p&gt;Don&apos;t add more rules hoping it helps. Verify each component. Find the broken step. Fix that one thing.&lt;/p&gt;
&lt;p&gt;PBR is a chain. Find the broken link.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>routing</category><category>troubleshooting</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>ARP and Neighbor Discovery: Troubleshooting Layer 2 Problems</title><link>https://ashimov.com/posts/vyos-arp-nd/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-arp-nd/</guid><description>Debug ARP and IPv6 ND issues on VyOS. Covers ARP table analysis, stale entries, duplicate IP detection, proxy ARP, neighbor discovery, and why Layer 2 problems look like Layer 3 failures.</description><pubDate>Fri, 26 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Routing is correct. Firewall allows traffic. Ping fails. You spend an hour checking Layer 3 and 4. The problem is Layer 2.&lt;/p&gt;
&lt;p&gt;ARP (IPv4) and Neighbor Discovery (IPv6) map IP addresses to MAC addresses. When this mapping fails, packets can&apos;t be delivered — even though routing looks perfect.&lt;/p&gt;
&lt;p&gt;Layer 2 problems look like Layer 3 failures. Always check ARP.&lt;/p&gt;
&lt;h2&gt;Understanding ARP&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Host A wants to send packet to 192.168.1.100:

1. Check ARP cache: &quot;Do I know the MAC for 192.168.1.100?&quot;
2. If no: Send ARP request (broadcast)
   &quot;Who has 192.168.1.100? Tell 192.168.1.1&quot;
3. 192.168.1.100 replies (unicast)
   &quot;192.168.1.100 is at aa:bb:cc:dd:ee:ff&quot;
4. Cache entry created, packet sent
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If ARP fails, IP packet can&apos;t be sent. Looks like routing problem, but it&apos;s MAC resolution.&lt;/p&gt;
&lt;h2&gt;Viewing ARP Table&lt;/h2&gt;
&lt;h3&gt;VyOS Commands&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show ARP table
show arp

# Output:
# IP Address      HW Address         Flags  Interface
# 192.168.1.100   aa:bb:cc:dd:ee:ff  C      eth1
# 192.168.1.101   bb:cc:dd:ee:ff:00  C      eth1
# 192.168.1.1     (incomplete)              eth1    ← Problem!

# C = Complete (resolved)
# (incomplete) = ARP request sent, no reply
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Detailed ARP Info&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Using ip command
ip neigh show

# Output:
# 192.168.1.100 dev eth1 lladdr aa:bb:cc:dd:ee:ff REACHABLE
# 192.168.1.101 dev eth1 lladdr bb:cc:dd:ee:ff:00 STALE
# 192.168.1.102 dev eth1  FAILED

# States:
# REACHABLE - Recently verified
# STALE - Not verified recently
# DELAY - Verification pending
# PROBE - Actively verifying
# FAILED - ARP resolution failed
# PERMANENT - Static entry
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Filter by Interface&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ARP entries for specific interface
ip neigh show dev eth1

# ARP entries for specific IP
ip neigh show 192.168.1.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ARP Problems and Solutions&lt;/h2&gt;
&lt;h3&gt;Problem 1: Incomplete ARP Entry&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom:
show arp
# 192.168.1.100   (incomplete)        eth1

# Causes:
# - Target host is down
# - Target host has wrong IP
# - Target host on different VLAN
# - Network issue between hosts

# Debug:
# 1. Capture ARP traffic
sudo tcpdump -i eth1 arp

# 2. See if requests go out, responses come back
# 08:30:01 ARP, Request who-has 192.168.1.100 tell 192.168.1.1
# (no reply = host unreachable at Layer 2)

# 3. Verify VLAN tagging
show interfaces ethernet eth1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 2: Wrong MAC Address&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Traffic goes to wrong host

# Check ARP for expected IP
show arp | grep 192.168.1.100

# If MAC doesn&apos;t match expected host:
# - Duplicate IP (two hosts same IP)
# - IP moved to different host
# - ARP spoofing attack

# Clear entry and let it re-resolve
ip neigh del 192.168.1.100 dev eth1
ping 192.168.1.100
show arp | grep 192.168.1.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 3: Stale ARP Entries&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: Intermittent connectivity after IP change

# Old MAC cached, traffic goes to wrong place
ip neigh show
# 192.168.1.100 dev eth1 lladdr aa:bb:cc:dd:ee:ff STALE

# Flush stale entry
ip neigh flush 192.168.1.100

# Or flush all on interface
ip neigh flush dev eth1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 4: ARP Table Full&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptom: New hosts can&apos;t connect

# Check table size
cat /proc/sys/net/ipv4/neigh/default/gc_thresh3
# Default: 1024

# If many hosts, increase:
configure
set system sysctl parameter net.ipv4.neigh.default.gc_thresh3 value 4096
commit

# Or via sysctl directly:
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Static ARP Entries&lt;/h2&gt;
&lt;p&gt;For critical hosts, use static ARP to prevent spoofing:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add static ARP entry
set protocols static arp 192.168.1.100 hwaddr aa:bb:cc:dd:ee:ff

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use Static ARP&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Critical servers (DNS, gateway)&lt;/li&gt;
&lt;li&gt;Security-sensitive hosts&lt;/li&gt;
&lt;li&gt;Environments with ARP spoofing risk&lt;/li&gt;
&lt;li&gt;Fixed infrastructure (won&apos;t change MAC)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Proxy ARP&lt;/h2&gt;
&lt;p&gt;Router answers ARP on behalf of other networks:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check if proxy ARP is enabled
cat /proc/sys/net/ipv4/conf/eth1/proxy_arp

# Enable proxy ARP on interface
configure
set interfaces ethernet eth1 ip enable-proxy-arp
commit

# Use case: When hosts on different subnets share same VLAN
# Router answers ARP for remote subnet, forwards traffic
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Proxy ARP Risks&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Breaks subnet boundaries&lt;/li&gt;
&lt;li&gt;Can cause routing confusion&lt;/li&gt;
&lt;li&gt;Security implications (answers for others)&lt;/li&gt;
&lt;li&gt;Usually sign of network design problem&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;IPv6 Neighbor Discovery&lt;/h2&gt;
&lt;p&gt;IPv6 uses ICMPv6 Neighbor Discovery instead of ARP:&lt;/p&gt;
&lt;h3&gt;View Neighbor Table&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show IPv6 neighbors
ip -6 neigh show

# Output:
# fe80::1 dev eth1 lladdr aa:bb:cc:dd:ee:ff REACHABLE
# 2001:db8::100 dev eth1 lladdr bb:cc:dd:ee:ff:00 STALE
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Neighbor Discovery Types&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;NDP Message Types:
- Neighbor Solicitation (NS): &quot;Who has this IPv6?&quot;
- Neighbor Advertisement (NA): &quot;I have this IPv6&quot;
- Router Solicitation (RS): &quot;Are there any routers?&quot;
- Router Advertisement (RA): &quot;I&apos;m a router, here&apos;s the prefix&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Debug ND Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture NDP traffic
sudo tcpdump -i eth1 icmp6

# Filter for specific types
sudo tcpdump -i eth1 &apos;icmp6 and ip6[40] == 135&apos;  # Neighbor Solicitation
sudo tcpdump -i eth1 &apos;icmp6 and ip6[40] == 136&apos;  # Neighbor Advertisement
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;IPv6 ND Problems&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Problem: Duplicate Address Detection fails
# Host won&apos;t configure IPv6 address

# Check for duplicate:
sudo tcpdump -i eth1 &apos;icmp6 and ip6[40] == 136&apos;

# Problem: No router advertisements
# Hosts can&apos;t find gateway

# Check RA on interface:
sudo tcpdump -i eth1 &apos;icmp6 and ip6[40] == 134&apos;

# VyOS sends RA if configured:
set interfaces ethernet eth1 ipv6 router-advert prefix 2001:db8::/64
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Duplicate IP Detection&lt;/h2&gt;
&lt;h3&gt;Detecting Duplicates&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check for multiple MACs responding to same IP
arping -D -I eth1 192.168.1.100

# If duplicate exists, arping gets responses from multiple MACs

# Or capture ARP and look for different MAC sources
sudo tcpdump -i eth1 arp and host 192.168.1.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Gratuitous ARP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Send gratuitous ARP (announce IP)
arping -A -I eth1 192.168.1.1 -c 1

# Use after IP address change or failover
# Updates ARP caches network-wide
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Scenarios&lt;/h2&gt;
&lt;h3&gt;Scenario 1: New Server Not Reachable&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Server configured, can&apos;t reach from router
ping 192.168.1.100
# PING 192.168.1.100: 56 data bytes
# (no response)

show arp | grep 192.168.1.100
# 192.168.1.100   (incomplete)

# ARP not resolving:
# - Server on wrong VLAN?
# - Server IP configured wrong?
# - Server interface down?

# From server side:
# ip addr show
# Check if IP is on correct interface
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 2: Traffic Goes to Wrong Host&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Application connecting to wrong server
show arp | grep 10.0.0.50
# 10.0.0.50   aa:bb:cc:dd:ee:ff   C   eth1

# But expected MAC was bb:cc:dd:ee:ff:00
# Duplicate IP! Two hosts have 10.0.0.50

# Solution:
# 1. Find both hosts
# 2. Remove duplicate IP from wrong host
# 3. Flush ARP
ip neigh flush 10.0.0.50
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 3: Connectivity Works Then Fails&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Works initially, fails after some time

# Check ARP timeout
cat /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms
# 30000 (30 seconds)

# Entry goes STALE, then needs refresh
# If refresh fails → connectivity lost

# Debug:
watch -n 1 &apos;ip neigh show 192.168.1.100&apos;
# Watch state transition
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 4: After Failover, Old IP Unreachable&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Failover happened, but clients still sending to old MAC

# Need gratuitous ARP from new server:
arping -A -I eth1 192.168.1.100 -c 3

# Or clear ARP cache on clients/routers:
ip neigh flush 192.168.1.100
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring ARP&lt;/h2&gt;
&lt;h3&gt;Watch ARP Table&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Continuous monitoring
watch -n 2 &apos;ip neigh show dev eth1&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Log ARP Changes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Linux doesn&apos;t log ARP by default
# Use arpwatch for monitoring:

apt install arpwatch
arpwatch -i eth1 -f /var/lib/arpwatch/eth1.dat

# Logs to syslog:
# new station 192.168.1.100 aa:bb:cc:dd:ee:ff
# changed ethernet address 192.168.1.100 old:mac new:mac
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Static ARP for Critical Infrastructure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Gateway, DNS, critical servers
set protocols static arp 192.168.1.1 hwaddr aa:bb:cc:dd:ee:ff
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Monitor for Duplicates&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Regular scan for duplicates
for ip in $(seq 1 254); do
    arping -D -c 1 -I eth1 192.168.1.$ip 2&amp;gt;/dev/null
done
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Clear ARP During Troubleshooting&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# When changing IPs or after failover
ip neigh flush dev eth1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Check ARP First&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Before deep Layer 3 debugging
show arp | grep &amp;lt;problem-ip&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Layer 2 problems look like Layer 3 failures. Always check ARP.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When ping fails:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is there an ARP entry?&lt;/li&gt;
&lt;li&gt;Is it complete or incomplete?&lt;/li&gt;
&lt;li&gt;Is the MAC address correct?&lt;/li&gt;
&lt;li&gt;Is the entry REACHABLE or STALE/FAILED?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Layer 2 issues cause:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Intermittent connectivity (stale entries)&lt;/li&gt;
&lt;li&gt;Wrong destination (wrong MAC)&lt;/li&gt;
&lt;li&gt;Complete failure (no entry)&lt;/li&gt;
&lt;li&gt;Slow performance (ARP delays)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;ARP is simple but foundational. When it breaks, nothing above it works. Check it first, not last.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>troubleshooting</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Packet Capture on VyOS: tcpdump Techniques for Real Debugging</title><link>https://ashimov.com/posts/vyos-packet-capture/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-packet-capture/</guid><description>Master packet capture on VyOS for troubleshooting. Covers tcpdump filters, capture strategies, decoding protocols, saving and analyzing captures, and why packets never lie.</description><pubDate>Tue, 23 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Logs say everything is fine. Routing table looks correct. Firewall rules seem right. But traffic still doesn&apos;t flow. What&apos;s actually happening?&lt;/p&gt;
&lt;p&gt;Packet capture shows you the truth. The actual packets. Not what the logs say happened, not what should happen according to config — what actually happens on the wire.&lt;/p&gt;
&lt;p&gt;Packets never lie.&lt;/p&gt;
&lt;h2&gt;Basic Packet Capture&lt;/h2&gt;
&lt;h3&gt;VyOS Monitor Traffic Command&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VyOS provides a wrapper around tcpdump
monitor traffic interface eth0

# Stop with Ctrl+C
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Direct tcpdump&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# More control with tcpdump directly
sudo tcpdump -i eth0

# Common options
sudo tcpdump -i eth0 -n          # Don&apos;t resolve names (faster)
sudo tcpdump -i eth0 -v          # Verbose
sudo tcpdump -i eth0 -vv         # More verbose
sudo tcpdump -i eth0 -c 100      # Capture 100 packets then stop
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Capture Filters&lt;/h2&gt;
&lt;h3&gt;By IP Address&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Traffic from specific source
sudo tcpdump -i eth0 -n src 192.168.1.100

# Traffic to specific destination
sudo tcpdump -i eth0 -n dst 8.8.8.8

# Traffic to or from host
sudo tcpdump -i eth0 -n host 192.168.1.100

# Traffic between two hosts
sudo tcpdump -i eth0 -n host 192.168.1.100 and host 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;By Network&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Traffic to/from subnet
sudo tcpdump -i eth0 -n net 192.168.1.0/24

# Traffic NOT from subnet
sudo tcpdump -i eth0 -n not net 192.168.1.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;By Protocol&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ICMP only
sudo tcpdump -i eth0 -n icmp

# TCP only
sudo tcpdump -i eth0 -n tcp

# UDP only
sudo tcpdump -i eth0 -n udp

# ARP
sudo tcpdump -i eth0 -n arp

# OSPF
sudo tcpdump -i eth0 -n proto ospf

# BGP (TCP 179)
sudo tcpdump -i eth0 -n tcp port 179
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;By Port&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Specific port
sudo tcpdump -i eth0 -n port 80
sudo tcpdump -i eth0 -n port 443

# Source port
sudo tcpdump -i eth0 -n src port 22

# Destination port
sudo tcpdump -i eth0 -n dst port 80

# Port range
sudo tcpdump -i eth0 -n portrange 1-1024
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Combined Filters&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# HTTP traffic from specific host
sudo tcpdump -i eth0 -n host 192.168.1.100 and port 80

# SSH excluding specific host
sudo tcpdump -i eth0 -n port 22 and not host 192.168.1.1

# All traffic except SSH (while connected via SSH)
sudo tcpdump -i eth0 -n not port 22

# TCP SYN packets (new connections)
sudo tcpdump -i eth0 -n &apos;tcp[tcpflags] &amp;amp; (tcp-syn) != 0&apos;

# TCP RST packets (connection resets)
sudo tcpdump -i eth0 -n &apos;tcp[tcpflags] &amp;amp; (tcp-rst) != 0&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Saving Captures&lt;/h2&gt;
&lt;h3&gt;Write to File&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Save to pcap file
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap

# Save with rotation (10 files of 100MB each)
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap -C 100 -W 10

# Save with timestamp in filename
sudo tcpdump -i eth0 -n -w /tmp/capture-$(date +%Y%m%d-%H%M%S).pcap
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Read from File&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Read saved capture
sudo tcpdump -r /tmp/capture.pcap

# Read with filter
sudo tcpdump -r /tmp/capture.pcap tcp port 80

# Transfer to workstation for Wireshark analysis
scp admin@router:/tmp/capture.pcap .
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Capture Strategies&lt;/h2&gt;
&lt;h3&gt;Strategy 1: Two-Point Capture&lt;/h3&gt;
&lt;p&gt;Capture at both ends to see what&apos;s happening:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On router ingress
sudo tcpdump -i eth0 -n host 192.168.1.100 -w /tmp/ingress.pcap

# On router egress
sudo tcpdump -i eth1 -n host 192.168.1.100 -w /tmp/egress.pcap

# Compare: Did packet arrive? Did it leave?
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Strategy 2: All-Interface Capture&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture on all interfaces
sudo tcpdump -i any -n host 192.168.1.100

# Shows which interface packets traverse
# Note: May see packet twice (in and out)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Strategy 3: Before/After NAT&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Inside interface (pre-NAT source IP)
sudo tcpdump -i eth1 -n src 192.168.1.100

# Outside interface (post-NAT source IP)
sudo tcpdump -i eth0 -n src &amp;lt;public-ip&amp;gt;

# Verify NAT translation is happening
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Strategy 4: Firewall Debug&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture traffic that should be allowed
sudo tcpdump -i eth0 -n dst port 443 and dst 192.168.1.100

# If packets arrive but connection fails:
# - Firewall blocking
# - No return route
# - Server not listening
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reading tcpdump Output&lt;/h2&gt;
&lt;h3&gt;TCP Three-Way Handshake&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Normal connection:
10:15:01 IP 192.168.1.100.54321 &amp;gt; 8.8.8.8.80: Flags [S], seq 1000
10:15:01 IP 8.8.8.8.80 &amp;gt; 192.168.1.100.54321: Flags [S.], seq 2000, ack 1001
10:15:01 IP 192.168.1.100.54321 &amp;gt; 8.8.8.8.80: Flags [.], ack 2001

# [S] = SYN
# [S.] = SYN-ACK
# [.] = ACK
# [P.] = PUSH-ACK (data)
# [F.] = FIN-ACK
# [R.] = RST-ACK
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Connection Refused&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# RST immediately after SYN
10:15:01 IP 192.168.1.100.54321 &amp;gt; 8.8.8.8.80: Flags [S], seq 1000
10:15:01 IP 8.8.8.8.80 &amp;gt; 192.168.1.100.54321: Flags [R.], ack 1001

# Means: Port closed or filtered
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Connection Timeout&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# SYN retransmits, no response
10:15:01 IP 192.168.1.100.54321 &amp;gt; 8.8.8.8.80: Flags [S], seq 1000
10:15:02 IP 192.168.1.100.54321 &amp;gt; 8.8.8.8.80: Flags [S], seq 1000
10:15:04 IP 192.168.1.100.54321 &amp;gt; 8.8.8.8.80: Flags [S], seq 1000

# Means: Packets not reaching destination or response not returning
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Protocol-Specific Captures&lt;/h2&gt;
&lt;h3&gt;DNS Debugging&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture DNS queries and responses
sudo tcpdump -i eth0 -n port 53

# Verbose to see query details
sudo tcpdump -i eth0 -n -v port 53

# Example output:
# 192.168.1.100.12345 &amp;gt; 8.8.8.8.53: 12345+ A? example.com.
# 8.8.8.8.53 &amp;gt; 192.168.1.100.12345: 12345 1/0/0 A 93.184.216.34
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;HTTP Debugging&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture HTTP traffic
sudo tcpdump -i eth0 -n -A port 80

# -A shows ASCII content (HTTP headers)
# WARNING: May capture sensitive data
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP Session&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture BGP traffic
sudo tcpdump -i eth0 -n tcp port 179

# See BGP OPEN, UPDATE, KEEPALIVE messages
sudo tcpdump -i eth0 -n -v tcp port 179
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSPF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture OSPF traffic
sudo tcpdump -i eth0 -n proto ospf

# See Hello, LSA, DBD packets
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;IPsec&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture IKE negotiation (UDP 500/4500)
sudo tcpdump -i eth0 -n udp port 500 or udp port 4500

# Capture ESP packets
sudo tcpdump -i eth0 -n proto esp
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Troubleshooting Scenarios&lt;/h2&gt;
&lt;h3&gt;Scenario 1: Traffic Not Reaching Destination&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Step 1: Capture at source
sudo tcpdump -i eth1 -n src 192.168.1.100 and dst 8.8.8.8

# Step 2: Capture at exit interface
sudo tcpdump -i eth0 -n src 192.168.1.100 and dst 8.8.8.8
# (use NAT source if applicable)

# If packets on eth1 but not eth0:
# - Firewall blocking
# - Routing issue
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 2: Asymmetric Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture on both interfaces
sudo tcpdump -i eth0 -n host 192.168.1.100
sudo tcpdump -i eth1 -n host 192.168.1.100

# If request on eth0, response on eth1:
# - Asymmetric routing
# - Might be dropped by stateful firewall
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 3: Connection Resets&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Find who sends RST
sudo tcpdump -i eth0 -n &apos;tcp[tcpflags] &amp;amp; (tcp-rst) != 0&apos;

# If RST from destination: Port closed or application error
# If RST from middle: Firewall, IPS, or timeout
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 4: MTU Problems&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Look for fragmentation
sudo tcpdump -i eth0 -n &apos;ip[6:2] &amp;amp; 0x1fff != 0&apos;

# Look for ICMP fragmentation needed
sudo tcpdump -i eth0 -n &apos;icmp[0] = 3 and icmp[1] = 4&apos;

# If you see these, MTU/MSS issue
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Advanced Techniques&lt;/h2&gt;
&lt;h3&gt;Capture Only Headers&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture only first 96 bytes (headers)
sudo tcpdump -i eth0 -n -s 96 -w /tmp/headers-only.pcap

# Reduces file size, still useful for analysis
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Ring Buffer for Continuous Capture&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Keep last 100MB of traffic
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap -C 10 -W 10

# 10 files × 10MB = 100MB rotating buffer
# Useful for catching intermittent issues
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Trigger-Based Capture&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# Start capture when problem detected
# /config/scripts/triggered-capture.sh

while true; do
    # Check for symptom (e.g., high conntrack)
    if [ $(cat /proc/sys/net/netfilter/nf_conntrack_count) -gt 50000 ]; then
        timeout 60 tcpdump -i eth0 -n -w /tmp/triggered-$(date +%s).pcap
    fi
    sleep 10
done
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Filter Early&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: Capture everything, filter later
sudo tcpdump -i eth0 -w /tmp/huge.pcap

# Good: Filter during capture
sudo tcpdump -i eth0 -n host 192.168.1.100 and port 80 -w /tmp/small.pcap
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Exclude SSH (When Connected via SSH)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Avoid capturing your own session
sudo tcpdump -i eth0 -n not port 22
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Use Names for Saved Files&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Include date, interface, purpose
sudo tcpdump -i eth0 -n host 192.168.1.100 \
  -w /tmp/eth0-192.168.1.100-$(date +%Y%m%d-%H%M%S).pcap
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Know When to Stop&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Set packet count limit
sudo tcpdump -i eth0 -n -c 1000 -w /tmp/capture.pcap

# Set time limit
timeout 60 sudo tcpdump -i eth0 -n -w /tmp/capture.pcap
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Packets never lie.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When troubleshooting fails with logs and commands, packet capture shows exactly what&apos;s happening:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Is the traffic arriving?&lt;/li&gt;
&lt;li&gt;Is it leaving?&lt;/li&gt;
&lt;li&gt;Is the firewall changing it?&lt;/li&gt;
&lt;li&gt;Is NAT translating it?&lt;/li&gt;
&lt;li&gt;Is the destination responding?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The capture tells you what logs cannot: the actual bytes on the wire.&lt;/p&gt;
&lt;p&gt;Every network engineer should be comfortable with tcpdump. Not Wireshark on a desktop — tcpdump on the router where the problem is.&lt;/p&gt;
&lt;p&gt;Start simple: &lt;code&gt;tcpdump -i eth0 -n host problem-ip&lt;/code&gt;. Build filters from there. Save captures for complex analysis. But start by looking at the packets.&lt;/p&gt;
&lt;p&gt;They never lie.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>troubleshooting</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Conntrack Deep Dive: Connection Tables, Limits, and Debugging</title><link>https://ashimov.com/posts/vyos-conntrack/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-conntrack/</guid><description>Master VyOS connection tracking internals. Covers conntrack tables, tuning limits, timeout configuration, debugging full tables, and why conntrack is the invisible stateful firewall engine.</description><pubDate>Fri, 19 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Your stateful firewall silently tracks every connection. Allowing &quot;established, related&quot; traffic to return requires remembering that you initiated the connection. That&apos;s conntrack.&lt;/p&gt;
&lt;p&gt;When conntrack table fills up, new connections fail mysteriously. No error, just dropped. The firewall has no room to track new connections, so it drops them.&lt;/p&gt;
&lt;p&gt;Conntrack is the invisible stateful firewall engine. When it fails, everything fails.&lt;/p&gt;
&lt;h2&gt;What Conntrack Does&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Client → Router → Server

1. Client sends SYN to server
2. Router creates conntrack entry: NEW
3. Server responds with SYN-ACK
4. Router updates entry: ESTABLISHED
5. Traffic flows, entry tracks state
6. Connection closes
7. Entry times out, removed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Without conntrack, the firewall can&apos;t know that a packet from the server is a response to your request vs. unsolicited traffic.&lt;/p&gt;
&lt;h2&gt;Viewing Conntrack Table&lt;/h2&gt;
&lt;h3&gt;Basic Commands&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all connections
show conntrack table

# Show IPv4 connections only
show conntrack table ipv4

# Show IPv6 connections only
show conntrack table ipv6

# Show connection count
show conntrack table statistics
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Filtering Conntrack&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show connections to specific IP
show conntrack table ipv4 | grep &quot;192.168.1.100&quot;

# Show connections by protocol
show conntrack table ipv4 | grep &quot;tcp&quot;
show conntrack table ipv4 | grep &quot;udp&quot;

# Show connections in specific state
show conntrack table ipv4 | grep &quot;ESTABLISHED&quot;
show conntrack table ipv4 | grep &quot;TIME_WAIT&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Direct Conntrack Commands&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Using conntrack tool directly
sudo conntrack -L                    # List all
sudo conntrack -L -p tcp             # TCP only
sudo conntrack -L -s 192.168.1.100   # Source IP
sudo conntrack -L -d 8.8.8.8         # Destination IP
sudo conntrack -C                    # Count entries
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conntrack Statistics&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# View statistics
show conntrack table statistics

# Or directly
sudo conntrack -S

# Output:
# cpu=0     found=12345 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
# cpu=1     found=12340 invalid=2 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0

# Key metrics:
# drop: Packets dropped due to conntrack issues
# early_drop: Entries dropped to make room for new ones
# insert_failed: Failed to create new entry (table full)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conntrack Table Size&lt;/h2&gt;
&lt;h3&gt;Check Current Settings&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Current maximum entries
cat /proc/sys/net/netfilter/nf_conntrack_max

# Current count
cat /proc/sys/net/netfilter/nf_conntrack_count

# Hash table size
cat /proc/sys/net/netfilter/nf_conntrack_buckets
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Increase Table Size&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Increase max connections
set system conntrack table-size 262144  # Default is often 65536

# Adjust hash table size (should be ~25% of table-size)
set system conntrack hash-size 65536

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Calculate Requirements&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Rule of thumb:
- Each connection uses ~350 bytes
- 65536 entries ≈ 22 MB RAM
- 262144 entries ≈ 90 MB RAM

# For NAT gateway with 1000 users:
# Assume 100 connections per user
# 1000 × 100 = 100,000 connections
# Set table-size to at least 150,000 (with headroom)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conntrack Timeouts&lt;/h2&gt;
&lt;h3&gt;View Current Timeouts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# TCP timeouts
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_close_wait

# UDP timeouts
cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout
cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream

# ICMP timeout
cat /proc/sys/net/netfilter/nf_conntrack_icmp_timeout
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure Timeouts in VyOS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# TCP timeouts
set system conntrack timeout tcp established 7200  # 2 hours (default 5 days!)
set system conntrack timeout tcp close 10
set system conntrack timeout tcp close-wait 60
set system conntrack timeout tcp fin-wait 120
set system conntrack timeout tcp last-ack 30
set system conntrack timeout tcp syn-recv 60
set system conntrack timeout tcp syn-sent 120
set system conntrack timeout tcp time-wait 120

# UDP timeouts
set system conntrack timeout udp other 30
set system conntrack timeout udp stream 180

# ICMP timeout
set system conntrack timeout icmp 30

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Aggressive Timeouts (For Busy NAT)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# For high-traffic NAT gateways with limited table size
set system conntrack timeout tcp established 3600   # 1 hour
set system conntrack timeout tcp time-wait 30       # Clear quickly
set system conntrack timeout udp other 30
set system conntrack timeout udp stream 60

# Warning: Too aggressive can break long-lived connections
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conntrack Problems&lt;/h2&gt;
&lt;h3&gt;Problem 1: Table Full&lt;/h3&gt;
&lt;p&gt;Symptoms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New connections fail randomly&lt;/li&gt;
&lt;li&gt;Established connections work&lt;/li&gt;
&lt;li&gt;Log shows &quot;nf_conntrack: table full&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# Check table status
show conntrack table statistics

# Look for:
# early_drop &amp;gt; 0
# insert_failed &amp;gt; 0

# Check current vs max
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Fix: Increase size
set system conntrack table-size 524288
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 2: Memory Exhaustion&lt;/h3&gt;
&lt;p&gt;Large conntrack tables consume RAM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Calculate memory needed
# 262144 entries × 350 bytes ≈ 90 MB

# If short on memory, reduce table or add RAM
# Or reduce timeouts to expire entries faster
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 3: Stale Entries&lt;/h3&gt;
&lt;p&gt;Connections closed but entries remain:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Clear specific entry
sudo conntrack -D -s 192.168.1.100 -d 8.8.8.8

# Clear all entries (dangerous in production!)
sudo conntrack -F

# Clear entries by protocol
sudo conntrack -D -p udp
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 4: Conntrack Not Tracking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Some traffic bypasses conntrack (NOTRACK)
# Check if tracking is enabled

show firewall

# Look for notrack rules:
# set firewall ipv4 name RAW rule 10 action notrack
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;NOTRACK Rules&lt;/h2&gt;
&lt;p&gt;For high-bandwidth traffic that doesn&apos;t need stateful inspection:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Skip tracking for specific traffic
set firewall ipv4 name RAW default-action accept
set firewall ipv4 name RAW rule 10 action notrack
set firewall ipv4 name RAW rule 10 destination address 224.0.0.0/4
set firewall ipv4 name RAW rule 10 description &quot;Skip tracking for multicast&quot;

# Apply to raw table
set firewall ipv4 input filter raw RAW

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use NOTRACK&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Multicast/broadcast traffic&lt;/li&gt;
&lt;li&gt;High-bandwidth services that don&apos;t need state&lt;/li&gt;
&lt;li&gt;Traffic between trusted internal segments&lt;/li&gt;
&lt;li&gt;When conntrack table is bottleneck&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Risks of NOTRACK&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;No stateful filtering for that traffic&lt;/li&gt;
&lt;li&gt;&quot;Established, related&quot; rules won&apos;t work&lt;/li&gt;
&lt;li&gt;Must use stateless rules for that traffic&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conntrack Helpers&lt;/h2&gt;
&lt;p&gt;Conntrack helpers track multi-connection protocols (FTP, SIP):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# View loaded helpers
lsmod | grep nf_conntrack

# Configure FTP helper
set system conntrack modules ftp

# Configure SIP helper
set system conntrack modules sip

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;FTP Active Mode Fix&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# FTP active mode requires helper to track data connection
set system conntrack modules ftp

# Helper allows firewall to recognize FTP data connections
# as related to control connection
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring Conntrack&lt;/h2&gt;
&lt;h3&gt;Watch Table Fill&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Monitor in real-time
watch -n 1 &apos;cat /proc/sys/net/netfilter/nf_conntrack_count&apos;

# Alert script
#!/bin/bash
MAX=$(cat /proc/sys/net/netfilter/nf_conntrack_max)
CURRENT=$(cat /proc/sys/net/netfilter/nf_conntrack_count)
THRESHOLD=80

USAGE=$((CURRENT * 100 / MAX))

if [ $USAGE -gt $THRESHOLD ]; then
    echo &quot;Conntrack ${USAGE}% full (${CURRENT}/${MAX})&quot;
fi
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prometheus Metrics&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Node exporter exposes conntrack metrics:
# node_nf_conntrack_entries
# node_nf_conntrack_entries_limit

# Alert when &amp;gt; 80% full
# expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit &amp;gt; 0.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conntrack by Connection Type&lt;/h2&gt;
&lt;h3&gt;Heavy NAT Users&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Find IPs with most connections
sudo conntrack -L | awk &apos;{print $5}&apos; | cut -d= -f2 | sort | uniq -c | sort -rn | head

# Output:
# 5234 192.168.1.50
# 3421 192.168.1.51
# ...
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Connection State Distribution&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Count by state
sudo conntrack -L | grep -o &apos;tcp .* [A-Z_]*&apos; | awk &apos;{print $NF}&apos; | sort | uniq -c | sort -rn

# Output:
# 45234 ESTABLISHED
#  2341 TIME_WAIT
#   123 SYN_SENT
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Size for Peak Load&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Calculate peak connections
# Add 50% headroom
set system conntrack table-size 262144  # More is safer
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Tune Timeouts for Traffic&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Long-lived connections (database, SSH)
set system conntrack timeout tcp established 86400  # 24h

# Short-lived connections (HTTP)
set system conntrack timeout tcp established 3600   # 1h
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Monitor Always&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Alert before table fills
# Dashboard showing:
# - Current entries
# - Max entries
# - Entries/second rate
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. NOTRACK Where Safe&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Reduce load by not tracking:
# - Internal trusted traffic
# - High-bandwidth transfers
# - Multicast
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Conntrack is the invisible stateful firewall engine. When it fails, everything fails.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Every &quot;allow established, related&quot; rule depends on conntrack. Every NAT translation depends on conntrack. Without it, no stateful firewalling.&lt;/p&gt;
&lt;p&gt;When conntrack table fills:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New connections silently fail&lt;/li&gt;
&lt;li&gt;No error message to user&lt;/li&gt;
&lt;li&gt;Existing connections keep working&lt;/li&gt;
&lt;li&gt;Very confusing symptoms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prevention:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Size table for expected load + headroom&lt;/li&gt;
&lt;li&gt;Tune timeouts for your traffic patterns&lt;/li&gt;
&lt;li&gt;Monitor table usage constantly&lt;/li&gt;
&lt;li&gt;Alert before it fills, not after&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The connection table is limited. Plan for it.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>networking</category><category>troubleshooting</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>TCP MSS Clamping: When and Why to Adjust Segment Size</title><link>https://ashimov.com/posts/vyos-tcp-mss/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-tcp-mss/</guid><description>Master TCP MSS clamping on VyOS for tunnels and PPPoE. Covers MSS vs MTU, clamping configuration, troubleshooting fragmentation, and why MSS clamping fixes problems MTU changes cannot.</description><pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;TCP connections work. Then you add a VPN. Suddenly large transfers fail while small ones work. SSH connects, but SCP stalls. Websites load headers but not content.&lt;/p&gt;
&lt;p&gt;The culprit: MTU mismatch. Your tunnel has overhead. Packets get fragmented or dropped. ICMP &quot;fragmentation needed&quot; messages get filtered. TCP never learns the path MTU.&lt;/p&gt;
&lt;p&gt;MSS clamping fixes this by telling TCP to use smaller segments in the first place. No fragmentation needed, no ICMP required.&lt;/p&gt;
&lt;h2&gt;MTU vs MSS&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;MTU&lt;/strong&gt; (Maximum Transmission Unit): Maximum IP packet size an interface can send.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ethernet default: 1500 bytes&lt;/li&gt;
&lt;li&gt;Includes IP header (20) and TCP header (20)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MSS&lt;/strong&gt; (Maximum Segment Size): Maximum TCP payload size.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Announced during TCP handshake&lt;/li&gt;
&lt;li&gt;MSS = MTU - 40 (IPv4) or MTU - 60 (IPv6)&lt;/li&gt;
&lt;li&gt;Ethernet default MSS: 1460 bytes (IPv4)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;MTU 1500:
┌─────────────────────────────────────────────────┐
│ IP Header │ TCP Header │      TCP Data         │
│   20 B    │    20 B    │       1460 B          │
└─────────────────────────────────────────────────┘
                           ↑ This is MSS
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Why Tunnels Break Large Transfers&lt;/h2&gt;
&lt;p&gt;Tunnels add encapsulation overhead:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Normal Ethernet (MTU 1500):
[ IP | TCP | Data 1460 bytes ]  = 1500 bytes ✓

GRE Tunnel (24 bytes overhead):
[ Outer IP | GRE | Inner IP | TCP | Data 1460 bytes ] = 1524 bytes ✗
                                                         ↑ Exceeds MTU!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fragment packets (slow, can fail)&lt;/li&gt;
&lt;li&gt;Lower tunnel MTU (requires end-to-end coordination)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MSS clamping&lt;/strong&gt; (transparent, works without coordination)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;MSS Clamping in VyOS&lt;/h2&gt;
&lt;h3&gt;Interface-Based Clamping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Clamp MSS on specific interface
set firewall options interface eth0 adjust-mss 1360

# For tunnel interfaces
set firewall options interface tun0 adjust-mss 1360
set firewall options interface wg0 adjust-mss 1380

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Calculating Correct MSS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Formula: MSS = Tunnel_MTU - 40 (IPv4)
# Or:      MSS = Tunnel_MTU - 60 (IPv6)

# Common scenarios:
# PPPoE (MTU 1492):    MSS = 1492 - 40 = 1452
# GRE (MTU 1476):      MSS = 1476 - 40 = 1436
# IPsec (~MTU 1400):   MSS = 1400 - 40 = 1360
# WireGuard (MTU 1420): MSS = 1420 - 40 = 1380
# VXLAN (MTU 1450):    MSS = 1450 - 40 = 1410
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Global MSS Clamping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Apply to all interfaces (less targeted but simpler)
set firewall options all-interfaces adjust-mss 1360

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MSS Clamping by Zone/Direction&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Clamp only for traffic leaving via tunnel
set firewall options interface tun0 adjust-mss 1360

# Clamp for traffic entering from LAN toward tunnel
set firewall options interface eth1 adjust-mss 1360  # LAN interface
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;PPPoE Configuration&lt;/h2&gt;
&lt;p&gt;PPPoE is the most common MSS clamping scenario:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# PPPoE interface setup
set interfaces ethernet eth0 pppoe 0 default-route auto
set interfaces ethernet eth0 pppoe 0 mtu 1492
set interfaces ethernet eth0 pppoe 0 name-server auto

# MSS clamping for PPPoE
set firewall options interface pppoe0 adjust-mss 1452

# Or use &apos;clamp-mss-to-pmtu&apos; to auto-calculate
set firewall options interface pppoe0 adjust-mss clamp-mss-to-pmtu

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VPN Tunnel Configuration&lt;/h2&gt;
&lt;h3&gt;IPsec&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# IPsec has variable overhead depending on encryption
# ESP header + encryption padding: ~50-80 bytes

# Conservative MSS for IPsec
set firewall options interface vti0 adjust-mss 1360

# Or on LAN-facing interface for traffic going to VPN
set firewall options interface eth1 adjust-mss 1360
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;WireGuard&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# WireGuard overhead: 60 bytes (IPv4) or 80 bytes (IPv6)
# Default WireGuard MTU: 1420

set firewall options interface wg0 adjust-mss 1380
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;GRE&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# GRE overhead: 24 bytes (basic) to 28+ (with key/sequence)
# GRE over IPsec: even more overhead

set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 mtu 1400
set firewall options interface tun0 adjust-mss 1360
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting MSS Issues&lt;/h2&gt;
&lt;h3&gt;Symptom: Large Transfers Fail&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Small files/pings work
ping -s 64 remote-host    # Works

# Large transfers fail/hang
ping -s 1400 remote-host  # Fails or hangs

# Solution: Add MSS clamping
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Detecting Current MSS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Capture TCP SYN packets to see advertised MSS
tcpdump -i eth0 &apos;tcp[tcpflags] &amp;amp; (tcp-syn) != 0&apos; -v

# Look for &quot;mss 1460&quot; or similar in output
# 14:32:15 IP host.port &amp;gt; dest.port: Flags [S], ... mss 1460 ...
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify Clamping is Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Before clamping:
# Client sends: mss 1460
# After clamping to 1360:
# Router modifies: mss 1360

# Capture on LAN side
tcpdump -i eth1 &apos;tcp[tcpflags] &amp;amp; (tcp-syn) != 0&apos; -v

# Capture on tunnel side
tcpdump -i tun0 &apos;tcp[tcpflags] &amp;amp; (tcp-syn) != 0&apos; -v

# Compare MSS values
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test Path MTU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Find actual path MTU
tracepath remote-host

# Manual test with DF bit
ping -M do -s 1372 remote-host  # Adjust size until works
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Scenarios&lt;/h2&gt;
&lt;h3&gt;Scenario 1: Site-to-Site VPN Users Can&apos;t Access Some Sites&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Problem: VPN tunnel MTU is 1400
         Client sends MSS 1460
         Large packets can&apos;t traverse tunnel

Solution:
set firewall options interface vti0 adjust-mss 1360
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 2: PPPoE Users Have Random Website Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Problem: PPPoE MTU 1492
         Some sites have PMTUD blackhole
         They never learn about lower MTU

Solution:
set firewall options interface pppoe0 adjust-mss 1452
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 3: GRE Tunnel Works for Ping, Not SCP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Problem: GRE overhead not accounted for
         Large SSH/SCP packets fragmented/dropped

Solution:
set interfaces tunnel tun0 mtu 1400
set firewall options interface tun0 adjust-mss 1360
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 4: Double-Tunnel (GRE over IPsec)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Problem: Outer tunnel already reduces MTU
         Inner tunnel reduces it more
         Need very low MSS

Solution:
# Outer IPsec: MTU ~1400
# GRE inside: MTU 1400 - 24 = 1376
# MSS: 1376 - 40 = 1336

set firewall options interface tun0 adjust-mss 1336
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Advanced Configuration&lt;/h2&gt;
&lt;h3&gt;IPv6 MSS Clamping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# IPv6 header is 40 bytes (vs 20 for IPv4)
# MSS = MTU - 60

set firewall options interface eth0 adjust-mss6 1340

# Or combined for both protocols
set firewall options interface eth0 adjust-mss 1360
set firewall options interface eth0 adjust-mss6 1340
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Asymmetric Clamping&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Different MSS for different directions
# Not directly supported, but can use firewall zones

# Traffic entering from Internet, leaving to tunnel
set firewall options interface eth0 adjust-mss 1360

# Traffic entering from tunnel, leaving to LAN
# (usually not needed - responses use same MSS)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Clamping with NAT&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# MSS clamping works with NAT
# Apply before or after NAT (usually doesn&apos;t matter)

set nat source rule 100 outbound-interface eth0
set nat source rule 100 translation address masquerade

set firewall options interface eth0 adjust-mss 1360

# Both NAT and MSS modification happen
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Why Not Just Lower MTU?&lt;/h2&gt;
&lt;p&gt;You could lower the MTU instead of MSS clamping:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Option A: Lower MTU on all devices
set interfaces ethernet eth0 mtu 1400
# Requires changing MTU on ALL hosts in network
# DHCP can help but not all clients respect it

# Option B: MSS clamping
set firewall options interface eth0 adjust-mss 1360
# Only affects TCP
# Transparent to endpoints
# No client changes needed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;MSS clamping advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transparent to endpoints&lt;/li&gt;
&lt;li&gt;No client configuration needed&lt;/li&gt;
&lt;li&gt;Only affects problematic TCP path&lt;/li&gt;
&lt;li&gt;Works even when you don&apos;t control endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MTU change advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Affects all protocols (UDP, etc.)&lt;/li&gt;
&lt;li&gt;No packet modification needed&lt;/li&gt;
&lt;li&gt;More &quot;correct&quot; solution&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Best practice:&lt;/strong&gt; Use MSS clamping for TCP-heavy tunnels. Lower MTU for UDP-heavy applications or when you control all devices.&lt;/p&gt;
&lt;h2&gt;Quick Reference&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tunnel Type&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;Safe MTU&lt;/th&gt;
&lt;th&gt;Safe MSS (IPv4)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PPPoE&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;1492&lt;/td&gt;
&lt;td&gt;1452&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GRE&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;1476&lt;/td&gt;
&lt;td&gt;1436&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPsec ESP&lt;/td&gt;
&lt;td&gt;50-80&lt;/td&gt;
&lt;td&gt;1400&lt;/td&gt;
&lt;td&gt;1360&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WireGuard&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;1420&lt;/td&gt;
&lt;td&gt;1380&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VXLAN&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;1450&lt;/td&gt;
&lt;td&gt;1410&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L2TP&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;1460&lt;/td&gt;
&lt;td&gt;1420&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;MSS clamping fixes problems MTU changes cannot.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When you don&apos;t control the endpoints, can&apos;t guarantee ICMP reaches them, and can&apos;t change their MTU — MSS clamping is your only option.&lt;/p&gt;
&lt;p&gt;The router intercepts TCP handshakes and modifies the MSS value. Endpoints never know it happened. They just use smaller segments that fit through your tunnel.&lt;/p&gt;
&lt;p&gt;Every tunnel should have MSS clamping configured. It costs nothing when not needed and saves hours of troubleshooting when it is.&lt;/p&gt;
&lt;p&gt;The symptoms are always vague: &quot;Large files fail, small ones work.&quot; The fix is always the same: calculate correct MSS, clamp it, done.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>troubleshooting</category><category>vpn</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>MTR, Tracepath, and PMTUD: Diagnosing Path Problems</title><link>https://ashimov.com/posts/vyos-mtr-pmtud/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-mtr-pmtud/</guid><description>Master network path diagnostics on VyOS. Covers MTR interpretation, traceroute variants, PMTUD troubleshooting, detecting packet loss patterns, and why ping alone is never enough.</description><pubDate>Fri, 12 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&quot;Can you ping it?&quot; Yes. &quot;Then why isn&apos;t it working?&quot; Because ping tests ICMP, not your application. Because one successful ping doesn&apos;t show the packet loss happening every 30 seconds. Because ping doesn&apos;t show which hop is the problem.&lt;/p&gt;
&lt;p&gt;Ping is a smoke test, not a diagnostic. Real troubleshooting needs tools that show the path, measure loss over time, and identify exactly where problems occur.&lt;/p&gt;
&lt;p&gt;Ping alone is never enough.&lt;/p&gt;
&lt;h2&gt;Understanding the Tools&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it shows&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ping&lt;/td&gt;
&lt;td&gt;Basic reachability&lt;/td&gt;
&lt;td&gt;Quick test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;traceroute&lt;/td&gt;
&lt;td&gt;Path to destination&lt;/td&gt;
&lt;td&gt;Find route&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mtr&lt;/td&gt;
&lt;td&gt;Path + statistics over time&lt;/td&gt;
&lt;td&gt;Find where loss occurs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tracepath&lt;/td&gt;
&lt;td&gt;Path + MTU discovery&lt;/td&gt;
&lt;td&gt;Find MTU issues&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;MTR Deep Dive&lt;/h2&gt;
&lt;p&gt;MTR combines traceroute with continuous ping statistics.&lt;/p&gt;
&lt;h3&gt;Basic MTR&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# From VyOS operational mode
mtr 8.8.8.8

# Output:
# Host                   Loss%   Snt   Last   Avg  Best  Wrst StDev
# 1. gateway.local        0.0%   100    1.2   1.5   0.8   3.2   0.5
# 2. isp-router.net       0.0%   100    8.3   9.1   7.2  15.3   1.8
# 3. core-router.isp      0.5%   100   12.1  13.2  11.0  45.2   5.2
# 4. ???
# 5. google-peer.net      0.0%   100   15.3  16.1  14.2  22.1   1.2
# 6. 8.8.8.8              0.0%   100   14.8  15.5  14.0  21.3   1.1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MTR Options&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Specify count
mtr -c 100 8.8.8.8

# Report mode (non-interactive)
mtr -r 8.8.8.8

# Wide report (show full hostnames)
mtr -rw 8.8.8.8

# Use TCP instead of ICMP
mtr -T -P 443 8.8.8.8

# Use UDP
mtr -u 8.8.8.8

# Set packet size
mtr -s 1400 8.8.8.8

# No DNS resolution (faster)
mtr -n 8.8.8.8

# Show AS numbers
mtr -z 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Interpreting MTR Output&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Host                   Loss%   Snt   Last   Avg  Best  Wrst StDev
1. 192.168.1.1          0.0%   100    1.2   1.5   0.8   3.2   0.5
2. 10.0.0.1            15.0%   100    8.3   9.1   7.2  15.3   1.8  ← Problem here?
3. 172.16.0.1          15.0%   100   12.1  13.2  11.0  45.2   5.2  ← Or here?
4. 8.8.8.8             0.0%    100   14.8  15.5  14.0  21.3   1.1  ← Destination OK
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Key insight&lt;/strong&gt;: Loss at hop 2 that continues to hop 3 but clears by destination means hop 2 is &lt;strong&gt;rate-limiting ICMP replies&lt;/strong&gt;, not dropping traffic. Real loss would persist to destination.&lt;/p&gt;
&lt;h3&gt;Reading Loss Patterns&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Pattern 1: Real loss
Hop 3:  15% loss
Hop 4:  15% loss
Hop 5:  15% loss  ← Loss persists to destination = real problem at hop 3

# Pattern 2: ICMP rate limiting
Hop 3:  15% loss
Hop 4:  15% loss
Hop 5:   0% loss  ← Clears at destination = hop 3 rate-limits ICMP, not real loss

# Pattern 3: Asymmetric routing
Hop 3:  high latency spike
Hop 4:  normal
Hop 5:  normal   ← Return path different, not a problem
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;MTR for Different Protocols&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ICMP might be filtered, try TCP
mtr -T -P 80 example.com

# Test actual service port
mtr -T -P 443 example.com
mtr -T -P 22 example.com

# UDP services
mtr -u -P 53 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Traceroute Variants&lt;/h2&gt;
&lt;h3&gt;Standard Traceroute&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ICMP traceroute
traceroute 8.8.8.8

# UDP traceroute (default on Linux)
traceroute -U 8.8.8.8

# TCP traceroute (bypasses some filters)
traceroute -T -p 443 8.8.8.8

# Don&apos;t resolve hostnames
traceroute -n 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Traceroute Options&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Set max hops
traceroute -m 30 8.8.8.8

# Set packet size
traceroute 8.8.8.8 1400

# Wait time per probe
traceroute -w 2 8.8.8.8

# Probes per hop
traceroute -q 5 8.8.8.8

# Source interface
traceroute -i eth0 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Interpreting Traceroute&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Normal output
1  192.168.1.1 (192.168.1.1)  1.234 ms  1.123 ms  1.345 ms
2  10.0.0.1 (10.0.0.1)  8.234 ms  8.123 ms  8.345 ms
3  * * *                        ← Hop doesn&apos;t respond
4  8.8.8.8 (8.8.8.8)  15.234 ms  15.123 ms  15.345 ms

# Stars (*) don&apos;t always mean problem
# Many routers don&apos;t respond to traceroute probes
# Final destination matters most
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Tracepath and PMTUD&lt;/h2&gt;
&lt;p&gt;Path MTU Discovery finds the maximum packet size that can traverse a path without fragmentation.&lt;/p&gt;
&lt;h3&gt;Using Tracepath&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Tracepath discovers MTU along path
tracepath 8.8.8.8

# Output includes MTU at each hop
# 1:  192.168.1.1         1.234ms pmtu 1500
# 2:  10.0.0.1            8.234ms pmtu 1500
# 3:  tunnel-endpoint    12.234ms pmtu 1400  ← MTU drops here
# 4:  8.8.8.8            15.234ms reached
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Manual PMTUD&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Find MTU using ping with DF flag
# Start with 1500, decrease until works

# Linux/VyOS
ping -M do -s 1472 8.8.8.8  # 1472 + 28 = 1500

# If fails, decrease size
ping -M do -s 1400 8.8.8.8
ping -M do -s 1372 8.8.8.8  # For tunnels with 1400 MTU
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Common MTU Values&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;MTU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ethernet&lt;/td&gt;
&lt;td&gt;1500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PPPoE&lt;/td&gt;
&lt;td&gt;1492&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GRE tunnel&lt;/td&gt;
&lt;td&gt;1476&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPsec (AES)&lt;/td&gt;
&lt;td&gt;~1400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WireGuard&lt;/td&gt;
&lt;td&gt;1420&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VXLAN&lt;/td&gt;
&lt;td&gt;1450&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;PMTUD Problems&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Symptoms of MTU problems:
# - Small packets work, large fail
# - SSH works, SCP hangs
# - Web pages partially load
# - TLS handshake fails

# Diagnose:
tracepath problematic-host.com

# Fix on VyOS:
# Option 1: Lower interface MTU
set interfaces ethernet eth0 mtu 1400

# Option 2: MSS clamping (better for VPN)
set firewall options interface eth0 adjust-mss 1360
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Diagnosing Common Problems&lt;/h2&gt;
&lt;h3&gt;Problem 1: Intermittent Packet Loss&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Run MTR for extended period
mtr -c 1000 destination.com

# Look for:
# - Consistent loss at specific hop
# - Loss that varies with time
# - Loss only during certain hours

# If loss at hop N continues to destination:
# Problem is at hop N or before
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 2: High Latency Spikes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check MTR StDev column
# High StDev = inconsistent latency

# Possible causes:
# - Congestion (check time of day)
# - Buffer bloat (test with different packet sizes)
# - Routing changes (check Wrst column)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 3: Path Changes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Run traceroute multiple times
for i in {1..10}; do traceroute -n 8.8.8.8; sleep 60; done

# Compare paths
# Different paths = load balancing or instability
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problem 4: Blackhole&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Traceroute stops at specific hop, no destination
1  192.168.1.1
2  10.0.0.1
3  172.16.0.1
4  * * *
5  * * *

# Possible causes:
# - Firewall blocking
# - Routing problem (no return path)
# - MTU blackhole (try smaller packets)

# Test with different methods:
traceroute -T -p 80 destination   # TCP
traceroute -I destination         # ICMP
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VyOS Specific Diagnostics&lt;/h2&gt;
&lt;h3&gt;Check Local Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify route to destination
show ip route 8.8.8.8

# Check for multiple paths
show ip route 8.8.8.8 longer-prefixes

# BGP-specific path info
show ip bgp 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Interface Stats&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Look for errors
show interfaces ethernet eth0

# Key metrics:
# - RX/TX errors
# - RX/TX drops
# - Collisions (shouldn&apos;t happen on modern networks)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Firewall Drops&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable logging on drop rules
set firewall ipv4 name WAN-IN rule 999 log enable

# Check logs
show log | grep DROP

# Might reveal blocked traffic
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Scripting Diagnostics&lt;/h2&gt;
&lt;h3&gt;Continuous Monitoring Script&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/path-monitor.sh

DESTINATION=$1
LOG_FILE=&quot;/var/log/path-monitor.log&quot;

while true; do
    echo &quot;=== $(date) ===&quot; &amp;gt;&amp;gt; $LOG_FILE
    mtr -r -c 10 $DESTINATION &amp;gt;&amp;gt; $LOG_FILE
    sleep 300
done
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Alert on High Loss&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/check-loss.sh

DESTINATION=&quot;8.8.8.8&quot;
THRESHOLD=5

LOSS=$(mtr -r -c 100 $DESTINATION | tail -1 | awk &apos;{print $3}&apos; | sed &apos;s/%//&apos;)

if [ $(echo &quot;$LOSS &amp;gt; $THRESHOLD&quot; | bc) -eq 1 ]; then
    echo &quot;High loss detected: ${LOSS}%&quot; | mail -s &quot;Network Alert&quot; admin@example.com
fi
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Always Test Bidirectionally&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Problem might be in return path
# Test from both ends when possible

# From VyOS:
mtr remote-host

# From remote host:
mtr vyos-router
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Test the Actual Service&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ICMP might work when TCP doesn&apos;t
mtr -T -P 443 website.com
mtr -T -P 22 server.com
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Collect Data Over Time&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# One-time test might miss intermittent issues
# Run extended tests during problem periods
mtr -c 500 problematic-host.com
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Document Baseline&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Know what &quot;normal&quot; looks like
# Run MTR when everything works
# Compare during problems
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Ping alone is never enough.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Ping tells you: &quot;Something responded to ICMP.&quot;&lt;/p&gt;
&lt;p&gt;MTR tells you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which hops are on the path&lt;/li&gt;
&lt;li&gt;Where packet loss occurs&lt;/li&gt;
&lt;li&gt;Whether loss is real or ICMP rate-limiting&lt;/li&gt;
&lt;li&gt;Latency variations and patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When troubleshooting:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with MTR, not ping&lt;/li&gt;
&lt;li&gt;Run long enough to catch patterns&lt;/li&gt;
&lt;li&gt;Test with relevant protocol (TCP/UDP)&lt;/li&gt;
&lt;li&gt;Test bidirectionally&lt;/li&gt;
&lt;li&gt;Compare to known baseline&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The path is usually the problem, not the endpoint. MTR shows you the path.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>troubleshooting</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>RADIUS and TACACS+: Centralized Authentication for Network Devices</title><link>https://ashimov.com/posts/vyos-aaa/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-aaa/</guid><description>Configure VyOS with RADIUS and TACACS+ for centralized AAA. Covers server setup, failover configuration, command authorization, accounting, and why central auth is non-negotiable at scale.</description><pubDate>Tue, 09 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Managing users on 50 routers means changing passwords in 50 places. Someone leaves the company — 50 deletions. New hire — 50 accounts to create. Password policy change — 50 updates.&lt;/p&gt;
&lt;p&gt;RADIUS and TACACS+ solve this. Users authenticate against a central server. Create once, authenticate everywhere. Revoke once, locked out everywhere.&lt;/p&gt;
&lt;p&gt;At scale, central authentication is non-negotiable.&lt;/p&gt;
&lt;h2&gt;AAA Concepts&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Who are you? (username/password, keys)
&lt;strong&gt;Authorization&lt;/strong&gt;: What can you do? (privilege levels, commands)
&lt;strong&gt;Accounting&lt;/strong&gt;: What did you do? (logging, audit)&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;RADIUS&lt;/th&gt;
&lt;th&gt;TACACS+&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Port&lt;/td&gt;
&lt;td&gt;UDP 1812/1813&lt;/td&gt;
&lt;td&gt;TCP 49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encryption&lt;/td&gt;
&lt;td&gt;Password only&lt;/td&gt;
&lt;td&gt;Full packet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authorization&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Per-command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Network access&lt;/td&gt;
&lt;td&gt;Device management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For router management, TACACS+ is generally preferred because it supports per-command authorization.&lt;/p&gt;
&lt;h2&gt;RADIUS Configuration&lt;/h2&gt;
&lt;h3&gt;Basic RADIUS Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add RADIUS server
set system login radius server 10.0.0.100 key &quot;RadiusSecretKey123&quot;
set system login radius server 10.0.0.100 port 1812

# Optional: Set timeout and retries
set system login radius server 10.0.0.100 timeout 3

# Enable RADIUS authentication
set system login radius source-address 192.168.1.1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Multiple RADIUS Servers&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Primary server
set system login radius server 10.0.0.100 key &quot;RadiusKey&quot;
set system login radius server 10.0.0.100 priority 1

# Backup server
set system login radius server 10.0.0.101 key &quot;RadiusKey&quot;
set system login radius server 10.0.0.101 priority 2

# Third server
set system login radius server 10.0.0.102 key &quot;RadiusKey&quot;
set system login radius server 10.0.0.102 priority 3

# VyOS tries servers in priority order
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;RADIUS with Local Fallback&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If RADIUS fails, fall back to local accounts
# Always keep at least one local admin account!

set system login user local-admin full-name &quot;Emergency Local Admin&quot;
set system login user local-admin authentication public-keys emergency key &quot;...&quot;
set system login user local-admin authentication public-keys emergency type ssh-ed25519

# Local accounts are tried after RADIUS fails
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;TACACS+ Configuration&lt;/h2&gt;
&lt;h3&gt;Basic TACACS+ Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add TACACS+ server
set system login tacacs server 10.0.0.100 key &quot;TacacsSecretKey456&quot;
set system login tacacs server 10.0.0.100 port 49

# Set source address
set system login tacacs source-address 192.168.1.1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Multiple TACACS+ Servers&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Primary
set system login tacacs server 10.0.0.100 key &quot;TacacsKey&quot;
set system login tacacs server 10.0.0.100 priority 1

# Backup
set system login tacacs server 10.0.0.101 key &quot;TacacsKey&quot;
set system login tacacs server 10.0.0.101 priority 2

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;TACACS+ Timeout&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Adjust timeout (default is usually fine)
set system login tacacs server 10.0.0.100 timeout 5
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;FreeRADIUS Server Setup&lt;/h2&gt;
&lt;h3&gt;Install FreeRADIUS (Ubuntu/Debian)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;apt update
apt install freeradius freeradius-utils
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure Client (VyOS Router)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/freeradius/3.0/clients.conf

client vyos-router {
    ipaddr = 192.168.1.1
    secret = RadiusSecretKey123
    shortname = vyos-main
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure Users&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/freeradius/3.0/users

# Admin user
admin-user Cleartext-Password := &quot;AdminPassword123&quot;
    Service-Type = Administrative-User,
    Cisco-AVPair = &quot;shell:priv-lvl=15&quot;

# Operator user
operator-user Cleartext-Password := &quot;OperatorPassword456&quot;
    Service-Type = NAS-Prompt-User,
    Cisco-AVPair = &quot;shell:priv-lvl=1&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Start FreeRADIUS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Test configuration
freeradius -X  # Debug mode

# Start service
systemctl enable freeradius
systemctl start freeradius

# Test authentication
radtest admin-user AdminPassword123 localhost 0 testing123
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;TACACS+ Server Setup (tac_plus)&lt;/h2&gt;
&lt;h3&gt;Install tac_plus&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;apt install tacacs+
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure tac_plus&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/tacacs+/tac_plus.conf

key = &quot;TacacsSecretKey456&quot;

accounting file = /var/log/tac_plus.acct

user = admin-user {
    member = admins
    login = cleartext &quot;AdminPassword123&quot;
}

user = operator-user {
    member = operators
    login = cleartext &quot;OperatorPassword456&quot;
}

group = admins {
    default service = permit
    service = exec {
        priv-lvl = 15
    }
}

group = operators {
    default service = deny
    service = exec {
        priv-lvl = 1
    }
    cmd = show {
        permit .*
    }
    cmd = ping {
        permit .*
    }
    cmd = traceroute {
        permit .*
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Start tac_plus&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;systemctl enable tacacs+
systemctl start tacacs+
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VyOS User Levels via AAA&lt;/h2&gt;
&lt;h3&gt;RADIUS Attributes for VyOS&lt;/h3&gt;
&lt;p&gt;VyOS uses standard RADIUS attributes. To set privilege level:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# In FreeRADIUS users file
admin-user Cleartext-Password := &quot;password&quot;
    Service-Type = Administrative-User
# Maps to admin level

operator-user Cleartext-Password := &quot;password&quot;
    Service-Type = NAS-Prompt-User
# Maps to operator level
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;TACACS+ Privilege Levels&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# In tac_plus.conf
service = exec {
    priv-lvl = 15  # Admin access
}

service = exec {
    priv-lvl = 1   # Operator access
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Testing Authentication&lt;/h2&gt;
&lt;h3&gt;Test from VyOS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Try to SSH with RADIUS/TACACS user
ssh admin-user@vyos-router

# Check logs
show log | grep -i radius
show log | grep -i tacacs
show log | grep -i auth
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test from Server&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# RADIUS test
radtest admin-user AdminPassword123 localhost 0 testing123

# TACACS+ test (requires test tool)
# Connect to VyOS and try login
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Debug Authentication Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On VyOS, check logs
show log | grep -i pam
show log | grep -i auth

# On RADIUS server (debug mode)
freeradius -X

# Common issues:
# - Wrong shared secret
# - Firewall blocking ports
# - Source address mismatch
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accounting Configuration&lt;/h2&gt;
&lt;h3&gt;RADIUS Accounting&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# RADIUS accounting sends session start/stop records
# Usually configured on RADIUS server side

# Check accounting logs on server
cat /var/log/radius/radacct/*/detail-*
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;TACACS+ Accounting&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# tac_plus logs commands
# Accounting file location in tac_plus.conf:
accounting file = /var/log/tac_plus.acct

# View accounting log
tail -f /var/log/tac_plus.acct
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;High Availability Setup&lt;/h2&gt;
&lt;h3&gt;Primary/Backup with Health Check&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Configure multiple servers
set system login radius server 10.0.0.100 priority 1
set system login radius server 10.0.0.101 priority 2

# VyOS automatically fails over if primary unavailable
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Geographic Distribution&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Datacenter 1
set system login radius server 10.1.0.100 priority 1

# Datacenter 2
set system login radius server 10.2.0.100 priority 2

# Local cache doesn&apos;t exist - ensure server availability
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Local Fallback (Critical)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# ALWAYS keep local emergency account
set system login user emergency-admin authentication public-keys key &quot;...&quot;
set system login user emergency-admin level admin

# If ALL RADIUS/TACACS servers fail, local accounts work
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Integration with LDAP/AD&lt;/h2&gt;
&lt;p&gt;RADIUS/TACACS+ can proxy to LDAP/Active Directory:&lt;/p&gt;
&lt;h3&gt;FreeRADIUS with LDAP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/freeradius/3.0/mods-enabled/ldap

ldap {
    server = &apos;ldap.example.com&apos;
    port = 389
    identity = &apos;cn=radius,dc=example,dc=com&apos;
    password = ldap_password
    base_dn = &apos;dc=example,dc=com&apos;

    user {
        base_dn = &quot;ou=users,${..base_dn}&quot;
        filter = &quot;(uid=%{%{Stripped-User-Name}:-%{User-Name}})&quot;
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Group-Based Access&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Map LDAP groups to VyOS levels
# In FreeRADIUS:

DEFAULT Ldap-Group == &quot;network-admins&quot;
    Service-Type = Administrative-User

DEFAULT Ldap-Group == &quot;network-operators&quot;
    Service-Type = NAS-Prompt-User
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Security Best Practices&lt;/h2&gt;
&lt;h3&gt;Secure Shared Secrets&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use strong secrets (32+ characters)
# Different secret per device (ideally)
# Store secrets in vault, not text files
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Network Security&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# TACACS+ port
set firewall ipv4 name MGMT-OUT rule 100 action accept
set firewall ipv4 name MGMT-OUT rule 100 destination port 49
set firewall ipv4 name MGMT-OUT rule 100 destination address 10.0.0.100
set firewall ipv4 name MGMT-OUT rule 100 protocol tcp

# RADIUS ports
set firewall ipv4 name MGMT-OUT rule 110 action accept
set firewall ipv4 name MGMT-OUT rule 110 destination port 1812-1813
set firewall ipv4 name MGMT-OUT rule 110 destination address 10.0.0.100
set firewall ipv4 name MGMT-OUT rule 110 protocol udp
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Encrypt Traffic&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# TACACS+ encrypts full packet (preferred)
# RADIUS only encrypts password (use with caution over untrusted networks)

# Consider:
# - VPN between router and AAA server
# - Dedicated management network
# - IPsec protected links
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Authentication Fails&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# 1. Verify connectivity
ping 10.0.0.100

# 2. Check ports
nc -zv 10.0.0.100 49      # TACACS+
nc -zvu 10.0.0.100 1812   # RADIUS

# 3. Check VyOS logs
show log | grep -i auth

# 4. Check server logs
# RADIUS: /var/log/freeradius/radius.log
# TACACS+: /var/log/tac_plus.log

# 5. Test locally on server
radtest user pass localhost 0 testing123
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Server Unreachable&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check source address configuration
show configuration commands | grep source-address

# Verify routing
show ip route 10.0.0.100

# Check firewall rules
show firewall
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;At scale, central authentication is non-negotiable.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;With local accounts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Employee leaves → update 50 routers&lt;/li&gt;
&lt;li&gt;Password breach → rotate on 50 routers&lt;/li&gt;
&lt;li&gt;New hire → create on 50 routers&lt;/li&gt;
&lt;li&gt;Audit → check 50 routers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With central AAA:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Employee leaves → disable one account&lt;/li&gt;
&lt;li&gt;Password breach → one place to update&lt;/li&gt;
&lt;li&gt;New hire → one account creation&lt;/li&gt;
&lt;li&gt;Audit → one central log&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The setup takes a few hours. The ongoing management saves hundreds of hours per year.&lt;/p&gt;
&lt;p&gt;Requirements:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Redundancy&lt;/strong&gt;: Multiple AAA servers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback&lt;/strong&gt;: Local emergency account always&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logging&lt;/strong&gt;: Central accounting for audit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: Encrypted protocols, strong secrets&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Don&apos;t run production network devices with only local accounts. Central authentication is infrastructure, not luxury.&lt;/p&gt;
</content:encoded><category>vyos</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>User Management: Local Users, SSH Keys, and Access Control</title><link>https://ashimov.com/posts/vyos-users/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-users/</guid><description>Configure VyOS user management properly. Covers local user creation, SSH key authentication, privilege levels, password policies, and why root password should be disabled.</description><pubDate>Fri, 05 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Default VyOS has one user: &lt;code&gt;vyos&lt;/code&gt; with password &lt;code&gt;vyos&lt;/code&gt;. If that&apos;s still your production setup, you have a security problem. Every scanning bot knows those credentials.&lt;/p&gt;
&lt;p&gt;Proper user management means: individual accounts, SSH keys instead of passwords, privilege separation, and audit trails. When something breaks, you need to know who touched what.&lt;/p&gt;
&lt;p&gt;Shared accounts are an audit nightmare. Individual accounts with SSH keys are the baseline.&lt;/p&gt;
&lt;h2&gt;Default User Problem&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# Default credentials
Username: vyos
Password: vyos

# Every automated scanner knows this
# First thing to change on new installation
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Creating Users&lt;/h2&gt;
&lt;h3&gt;Basic User Creation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create admin user with full privileges
set system login user admin full-name &quot;System Administrator&quot;
set system login user admin authentication plaintext-password &quot;SecurePassword123!&quot;

# Create operator user with limited access
set system login user operator full-name &quot;Network Operator&quot;
set system login user operator authentication plaintext-password &quot;OperatorPass456!&quot;
set system login user operator level operator

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;User Levels&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Access&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;admin&lt;/td&gt;
&lt;td&gt;Full configuration access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;operator&lt;/td&gt;
&lt;td&gt;Show commands, limited operational commands&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;# Admin level (default)
set system login user admin level admin

# Operator level (read-mostly)
set system login user operator level operator
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Operator Limitations&lt;/h3&gt;
&lt;p&gt;Operators can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;View configuration&lt;/li&gt;
&lt;li&gt;Run show commands&lt;/li&gt;
&lt;li&gt;Basic operational commands&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Operators cannot:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enter configuration mode&lt;/li&gt;
&lt;li&gt;Modify settings&lt;/li&gt;
&lt;li&gt;Restart services&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;SSH Key Authentication&lt;/h2&gt;
&lt;h3&gt;Generate Keys (Client Side)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On your workstation
ssh-keygen -t ed25519 -C &quot;admin@example.com&quot;
# Or RSA if ed25519 not supported
ssh-keygen -t rsa -b 4096 -C &quot;admin@example.com&quot;

# Get public key
cat ~/.ssh/id_ed25519.pub
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Add Key to VyOS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Add SSH key for user
set system login user admin authentication public-keys admin@workstation key &quot;AAAAC3NzaC1lZDI1NTE5AAAAIBxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&quot;
set system login user admin authentication public-keys admin@workstation type ssh-ed25519

# Or for RSA
set system login user admin authentication public-keys admin@workstation key &quot;AAAAB3NzaC1yc2EAAAADAQABAAACAQxxxxxxxxx&quot;
set system login user admin authentication public-keys admin@workstation type ssh-rsa

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Multiple Keys Per User&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Work laptop
set system login user admin authentication public-keys work-laptop key &quot;...&quot;
set system login user admin authentication public-keys work-laptop type ssh-ed25519

# Home workstation
set system login user admin authentication public-keys home-desktop key &quot;...&quot;
set system login user admin authentication public-keys home-desktop type ssh-ed25519

# Emergency key (stored securely)
set system login user admin authentication public-keys emergency key &quot;...&quot;
set system login user admin authentication public-keys emergency type ssh-ed25519
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Disable Password Authentication&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# After adding SSH keys, disable password login
set service ssh disable-password-authentication

commit

# Now only SSH key authentication works
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Remove Default User&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;configure

# First, ensure you can login with new user!
# Test SSH key login in another terminal before deleting vyos user

# Delete default user
delete system login user vyos

commit

# If you lock yourself out, you&apos;ll need console access
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;SSH Service Configuration&lt;/h2&gt;
&lt;h3&gt;Basic SSH Hardening&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Listen only on management interface
set service ssh listen-address 192.168.1.1

# Change port (optional, security through obscurity)
set service ssh port 22222

# Disable password authentication
set service ssh disable-password-authentication

# Set login timeout
set service ssh timeout 120

# Limit authentication attempts
set service ssh max-auth-retries 3

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Allowed Networks&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use firewall to limit SSH source IPs
set firewall ipv4 name MGMT-LOCAL rule 10 action accept
set firewall ipv4 name MGMT-LOCAL rule 10 destination port 22
set firewall ipv4 name MGMT-LOCAL rule 10 protocol tcp
set firewall ipv4 name MGMT-LOCAL rule 10 source address 10.0.0.0/24
set firewall ipv4 name MGMT-LOCAL rule 10 description &quot;SSH from admin network only&quot;

set firewall ipv4 name MGMT-LOCAL rule 999 action drop
set firewall ipv4 name MGMT-LOCAL rule 999 description &quot;Drop all other&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SSH Client Keepalive&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Keep connections alive
set service ssh client-keepalive-interval 60
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Per-User Restrictions&lt;/h2&gt;
&lt;h3&gt;Restrict User to Specific Source&lt;/h3&gt;
&lt;p&gt;Can&apos;t be done directly in VyOS, but use firewall:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create group for restricted user&apos;s source
set firewall group network-group OPERATOR-NETS network 192.168.10.0/24

# Firewall rule allowing operator SSH only from specific network
# Combined with per-user SSH keys for enforcement
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Login Tracking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# View current sessions
show users

# View login history
show log | grep -i ssh
show log | grep -i login

# Last logins
last
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Emergency Access&lt;/h2&gt;
&lt;h3&gt;Console Access&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Serial console always works
# Configure serial port
set system console device ttyS0 speed 115200
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Emergency User&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create break-glass account
set system login user emergency full-name &quot;Emergency Access&quot;
set system login user emergency authentication public-keys emergency key &quot;...&quot;
set system login user emergency authentication public-keys emergency type ssh-ed25519

# Store private key securely (safe, vault, etc.)
# Only use when normal access fails
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Password Policies&lt;/h2&gt;
&lt;p&gt;VyOS doesn&apos;t have built-in password policies, but best practices:&lt;/p&gt;
&lt;h3&gt;Strong Passwords&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# When setting passwords, enforce complexity
# Minimum 12 characters
# Mix of upper, lower, numbers, symbols

# Example (use password manager to generate)
set system login user admin authentication plaintext-password &quot;K8#mP9$nL2@qR5&amp;amp;w&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Encrypted Password Storage&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VyOS shows passwords encrypted in config
show configuration commands | grep authentication

# Output shows hash, not plaintext:
# set system login user admin authentication encrypted-password &apos;$6$rounds=xxx$...&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Regular Password Rotation&lt;/h3&gt;
&lt;p&gt;No automated policy, but establish process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Document rotation schedule&lt;/li&gt;
&lt;li&gt;Use calendar reminders&lt;/li&gt;
&lt;li&gt;Change all passwords&lt;/li&gt;
&lt;li&gt;Update documentation&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Service Accounts&lt;/h2&gt;
&lt;p&gt;For automation (Ansible, scripts):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create service account
set system login user ansible full-name &quot;Ansible Automation&quot;
set system login user ansible authentication public-keys ansible-server key &quot;...&quot;
set system login user ansible authentication public-keys ansible-server type ssh-ed25519

# Admin level needed for configuration
set system login user ansible level admin

# Consider: dedicated key per automation tool
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Audit Trail&lt;/h2&gt;
&lt;h3&gt;Enable Logging&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VyOS logs authentication events to syslog
show log | grep -i auth

# Send to remote syslog for retention
set system syslog host 10.0.0.100 facility auth level info
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What Gets Logged&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;SSH login success/failure&lt;/li&gt;
&lt;li&gt;Configuration changes&lt;/li&gt;
&lt;li&gt;Privilege escalation&lt;/li&gt;
&lt;li&gt;User source IP&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Review Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Recent auth events
show log | grep -i auth | tail -50

# Failed logins
show log | grep -i &quot;Failed password&quot;

# Configuration changes
show log | grep -i commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multi-User Setup Example&lt;/h2&gt;
&lt;h3&gt;Complete Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Admin users (full access)
set system login user admin1 full-name &quot;Alice Admin&quot;
set system login user admin1 authentication public-keys laptop key &quot;...&quot;
set system login user admin1 authentication public-keys laptop type ssh-ed25519
set system login user admin1 level admin

set system login user admin2 full-name &quot;Bob Admin&quot;
set system login user admin2 authentication public-keys laptop key &quot;...&quot;
set system login user admin2 authentication public-keys laptop type ssh-ed25519
set system login user admin2 level admin

# Operator users (limited access)
set system login user noc1 full-name &quot;NOC Operator 1&quot;
set system login user noc1 authentication public-keys workstation key &quot;...&quot;
set system login user noc1 authentication public-keys workstation type ssh-ed25519
set system login user noc1 level operator

# Service account (for automation)
set system login user ansible full-name &quot;Ansible Service&quot;
set system login user ansible authentication public-keys server key &quot;...&quot;
set system login user ansible authentication public-keys server type ssh-ed25519
set system login user ansible level admin

# Remove default user
delete system login user vyos

# SSH hardening
set service ssh disable-password-authentication

commit
save
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Shared accounts are an audit nightmare. Individual accounts with SSH keys are the baseline.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Minimum requirements:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Individual accounts&lt;/strong&gt;: One user = one person&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSH keys&lt;/strong&gt;: No password authentication&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Principle of least privilege&lt;/strong&gt;: Operators don&apos;t need admin&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Remove defaults&lt;/strong&gt;: Delete vyos user&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Log everything&lt;/strong&gt;: Remote syslog for audit&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When the next security incident happens:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With shared accounts: &quot;Someone changed something&quot;&lt;/li&gt;
&lt;li&gt;With individual accounts: &quot;admin1 changed firewall rule 50 at 14:32 from IP 10.0.0.55&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The audit trail is the difference between &quot;we don&apos;t know&quot; and &quot;here&apos;s exactly what happened.&quot;&lt;/p&gt;
&lt;p&gt;Set up users properly from day one. Retrofitting access control during an incident is not fun.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Upgrade Playbook: Safe Upgrades, Rollback, and Migration Testing</title><link>https://ashimov.com/posts/vyos-upgrade/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-upgrade/</guid><description>Master VyOS upgrades without downtime or disasters. Covers image management, rollback procedures, pre-upgrade testing, migration paths, and why upgrades need a playbook, not improvisation.</description><pubDate>Tue, 02 Dec 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&quot;Just upgrade to the latest version&quot; sounds simple. Then you discover your config syntax changed, a feature was deprecated, and your BGP peers are down. The router is 200 km away. It&apos;s Friday at 6 PM.&lt;/p&gt;
&lt;p&gt;VyOS image-based upgrades are actually quite safe — if you follow a process. The system keeps multiple images. Rollback is one reboot away. But you need to test before production.&lt;/p&gt;
&lt;p&gt;Upgrades need a playbook, not improvisation.&lt;/p&gt;
&lt;h2&gt;VyOS Image System&lt;/h2&gt;
&lt;p&gt;VyOS runs from images. Multiple images can coexist:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# List installed images
show system image

# Output:
# The system currently has the following image(s) installed:
#   1: 1.4.0 (default boot)
#   2: 1.3.5 (running)
#   3: 1.3.4
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key concepts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Running image&lt;/strong&gt;: Currently booted&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Default boot&lt;/strong&gt;: Will boot on next restart&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multiple images&lt;/strong&gt;: Can have several installed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Pre-Upgrade Checklist&lt;/h2&gt;
&lt;h3&gt;1. Backup Everything&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Full config backup
show configuration commands &amp;gt; /config/backup-pre-upgrade.txt

# Save to local file
save /config/config.boot.backup-$(date +%Y%m%d)

# Copy offsite
scp /config/config.boot admin@backup-server:/backups/

# Also backup if you have custom scripts
tar -czf /tmp/config-backup.tar.gz /config/scripts/ /config/user-data/
scp /tmp/config-backup.tar.gz admin@backup-server:/backups/
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Document Current State&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Version info
show version

# Running config
show configuration

# Interface status
show interfaces

# Routing state
show ip route
show ip bgp summary
show ip ospf neighbor

# Save all this output!
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Check Release Notes&lt;/h3&gt;
&lt;p&gt;Before upgrading, read:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Release notes for target version&lt;/li&gt;
&lt;li&gt;Migration notes&lt;/li&gt;
&lt;li&gt;Known issues&lt;/li&gt;
&lt;li&gt;Deprecated features&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# VyOS release notes
# https://docs.vyos.io/en/latest/changelog/

# Check what changed between versions
# Pay attention to:
# - Breaking changes
# - Syntax changes
# - Feature deprecations
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Verify Disk Space&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check space for new image
df -h

# Images typically need 1-2 GB
# If low on space, remove old images
delete system image 1.3.3
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Download and Add Image&lt;/h2&gt;
&lt;h3&gt;From Release Server&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Add image from URL
add system image https://github.com/vyos/vyos-rolling-nightly-builds/releases/download/1.4-rolling-YYYYMMDD/vyos-1.4-rolling-YYYYMMDD-amd64.iso

# Or from local file
add system image /tmp/vyos-1.4.0-amd64.iso
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify Download&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VyOS verifies image signature automatically
# Watch for verification messages during add

# After adding:
show system image
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Upgrade Process&lt;/h2&gt;
&lt;h3&gt;Step 1: Add New Image&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Download and add
add system image https://path/to/vyos-1.4.0-amd64.iso

# Follow prompts:
# - Confirm image signature
# - Keep or overwrite config
# - Set as default boot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Set Default Boot&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If not set during add:
set system image default-boot 1.4.0

# Verify
show system image
# Should show 1.4.0 as default boot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Reboot with Confirm Plan&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Before rebooting, ensure you have:
# - Console access (in case new image fails)
# - Rollback plan documented
# - Maintenance window scheduled

# Reboot
reboot

# Or schedule for off-hours
reboot at 02:00
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Verify After Boot&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check version
show version

# Verify config loaded
show configuration

# Check critical services
show interfaces
show ip bgp summary
show ip ospf neighbor

# Test connectivity
ping 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Rollback Procedures&lt;/h2&gt;
&lt;h3&gt;If New Image Fails to Boot&lt;/h3&gt;
&lt;p&gt;At GRUB menu:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Select previous image from boot menu&lt;/li&gt;
&lt;li&gt;System boots with old image&lt;/li&gt;
&lt;li&gt;Config is preserved&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;After Booting Bad Image&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Set old image as default
set system image default-boot 1.3.5

# Reboot to old image
reboot

# After reboot, optionally delete bad image
delete system image 1.4.0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;During Upgrade Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If config migration fails
# System usually keeps backup

# Check for backup configs
ls /config/

# Restore backup
cp /config/config.boot.backup-20250108 /config/config.boot
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Testing New Images&lt;/h2&gt;
&lt;h3&gt;Test in Lab First&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create test VM
# Same hardware profile as production

# Install current production version
# Apply production config (sanitized)
# Upgrade to new version
# Test everything

# Lab testing catches:
# - Config migration issues
# - Feature deprecations
# - Breaking changes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Staged Rollout&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Day 1: Upgrade lab
Day 2-3: Test all features in lab
Day 4: Upgrade least critical production router
Day 5-7: Monitor
Day 8: Upgrade remaining routers
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;## Upgrade Test Checklist

### Pre-Upgrade
- [ ] Config backup completed
- [ ] Release notes reviewed
- [ ] Lab test passed
- [ ] Maintenance window scheduled
- [ ] Console access verified
- [ ] Team notified

### Post-Upgrade Verification
- [ ] System booted successfully
- [ ] Correct version running
- [ ] All interfaces up
- [ ] BGP sessions established
- [ ] OSPF neighbors formed
- [ ] VPN tunnels up
- [ ] NAT working
- [ ] Firewall rules active
- [ ] DNS resolution working
- [ ] Monitoring connected

### Sign-off
- [ ] All tests passed
- [ ] No errors in logs
- [ ] Performance normal
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Version-Specific Migration&lt;/h2&gt;
&lt;h3&gt;1.3.x to 1.4.x Migration&lt;/h3&gt;
&lt;p&gt;Major syntax changes in VyOS 1.4:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Firewall syntax changed significantly

# 1.3.x style:
set firewall name WAN-IN rule 10 action accept

# 1.4.x style:
set firewall ipv4 name WAN-IN rule 10 action accept
# Note the &apos;ipv4&apos; addition

# VyOS migrates automatically, but verify
show firewall
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Migration Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# After upgrade, check for warnings
show log | grep -i migrat
show log | grep -i deprecat
show log | grep -i error

# Migration script output
cat /var/log/vyos-migrate.log
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Downgrade Considerations&lt;/h2&gt;
&lt;h3&gt;Can You Downgrade?&lt;/h3&gt;
&lt;p&gt;Generally yes, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Config format may have changed&lt;/li&gt;
&lt;li&gt;New features won&apos;t exist in old version&lt;/li&gt;
&lt;li&gt;May need manual config adjustment&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# Downgrade process:
# 1. Keep old image installed
set system image default-boot 1.3.5

# 2. Reboot
reboot

# 3. Old config should load
# 4. Check for issues
show configuration
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Config Compatibility&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Before upgrade, save config in both formats

# As commands (works across versions)
show configuration commands &amp;gt; /config/backup-commands.txt

# As JSON (useful for parsing)
show configuration json &amp;gt; /config/backup-json.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Automation&lt;/h2&gt;
&lt;h3&gt;Ansible Upgrade Playbook&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;---
- name: VyOS Upgrade Playbook
  hosts: vyos_routers
  gather_facts: no

  vars:
    new_image_url: &quot;https://...&quot;
    backup_dir: &quot;/tmp/vyos-backups&quot;

  tasks:
    - name: Backup configuration
      vyos.vyos.vyos_command:
        commands:
          - show configuration commands
      register: config_backup

    - name: Save backup locally
      local_action:
        module: copy
        content: &quot;{{ config_backup.stdout[0] }}&quot;
        dest: &quot;{{ backup_dir }}/{{ inventory_hostname }}-pre-upgrade.txt&quot;

    - name: Download new image
      vyos.vyos.vyos_command:
        commands:
          - &quot;add system image {{ new_image_url }}&quot;
      register: add_result

    - name: Set new image as default
      vyos.vyos.vyos_command:
        commands:
          - &quot;set system image default-boot&quot;

    - name: Reboot (async)
      vyos.vyos.vyos_command:
        commands:
          - reboot now
      async: 1
      poll: 0

    - name: Wait for reboot
      wait_for:
        host: &quot;{{ ansible_host }}&quot;
        port: 22
        delay: 60
        timeout: 300

    - name: Verify new version
      vyos.vyos.vyos_command:
        commands:
          - show version
      register: version_check

    - name: Display version
      debug:
        var: version_check.stdout_lines
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Emergency Recovery&lt;/h2&gt;
&lt;h3&gt;Boot from ISO&lt;/h3&gt;
&lt;p&gt;If all images fail:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Boot from VyOS ISO (USB/CD)&lt;/li&gt;
&lt;li&gt;Select &quot;Live&quot; option&lt;/li&gt;
&lt;li&gt;Mount existing config:&lt;pre&gt;&lt;code&gt;mount /dev/sda1 /mnt
cp /mnt/config/config.boot /config/
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Install fresh image:&lt;pre&gt;&lt;code&gt;install image
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Serial Console Recovery&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Connect via serial console
# Speed: 115200 8N1

# At GRUB, can select any installed image
# Even if network is misconfigured
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Upgrades need a playbook, not improvisation.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The playbook:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Backup&lt;/strong&gt; everything before touching anything&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test&lt;/strong&gt; in lab with production config&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read&lt;/strong&gt; release notes for breaking changes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stage&lt;/strong&gt; rollout across multiple days&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify&lt;/strong&gt; everything after upgrade&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rollback&lt;/strong&gt; plan ready before you start&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What makes VyOS upgrades safe:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple images coexist&lt;/li&gt;
&lt;li&gt;Rollback is one reboot&lt;/li&gt;
&lt;li&gt;Config usually migrates automatically&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What makes upgrades dangerous:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No backup&lt;/li&gt;
&lt;li&gt;No testing&lt;/li&gt;
&lt;li&gt;No rollback plan&lt;/li&gt;
&lt;li&gt;Friday at 5 PM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Treat every upgrade as potentially breaking. Have the rollback plan ready. Test first. Then it&apos;s boring and safe — exactly how maintenance should be.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Configuration Standards: Naming, Comments, Structure That Scales</title><link>https://ashimov.com/posts/vyos-config-standards/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-config-standards/</guid><description>Build maintainable VyOS configurations with consistent naming, strategic comments, firewall groups, and policy structure. Learn standards that make configs readable years later.</description><pubDate>Fri, 28 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Look at a router config written three years ago. Can you understand what each rule does? Who added it? Why? If the answer is &quot;no,&quot; you have a maintenance problem.&lt;/p&gt;
&lt;p&gt;Configuration standards aren&apos;t bureaucracy. They&apos;re the difference between &quot;I can fix this in 5 minutes&quot; and &quot;I need to reverse-engineer 500 rules to understand what might break.&quot;&lt;/p&gt;
&lt;p&gt;Good config is self-documenting. Bad config is a puzzle box that only its creator could solve — and they left the company.&lt;/p&gt;
&lt;h2&gt;Naming Conventions&lt;/h2&gt;
&lt;h3&gt;Interface Descriptions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: No description
set interfaces ethernet eth0

# Bad: Useless description
set interfaces ethernet eth0 description &quot;interface 0&quot;

# Good: Purpose and destination
set interfaces ethernet eth0 description &quot;WAN: ISP-Acme-1Gbps-Circuit-12345&quot;
set interfaces ethernet eth1 description &quot;LAN: Server-VLAN-10.0.0.0/24&quot;
set interfaces ethernet eth2 description &quot;MGMT: OOB-Management-172.16.0.0/24&quot;

# Pattern: &amp;lt;ZONE&amp;gt;: &amp;lt;Purpose&amp;gt;-&amp;lt;Details&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VLAN Naming&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Include VLAN ID in description for clarity
set interfaces ethernet eth0 vif 100 description &quot;VLAN100: Production-Servers&quot;
set interfaces ethernet eth0 vif 200 description &quot;VLAN200: Development&quot;
set interfaces ethernet eth0 vif 999 description &quot;VLAN999: Quarantine-Untrusted&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Firewall Names&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Pattern: &amp;lt;FROM-ZONE&amp;gt;-&amp;lt;TO-ZONE&amp;gt; or &amp;lt;INTERFACE&amp;gt;-&amp;lt;DIRECTION&amp;gt;

# Zone-based naming
set firewall ipv4 name WAN-TO-LAN ...
set firewall ipv4 name LAN-TO-WAN ...
set firewall ipv4 name DMZ-TO-LAN ...

# Interface-based naming
set firewall ipv4 name ETH0-IN ...
set firewall ipv4 name ETH0-OUT ...
set firewall ipv4 name ETH0-LOCAL ...

# Pick one pattern, use it consistently
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Firewall Rule Numbering&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Reserve ranges for different purposes:
# 1-99:     Critical infrastructure rules
# 100-199:  Management access
# 200-499:  Application rules
# 500-899:  User rules
# 900-998:  Logging/monitoring rules
# 999:      Default deny (explicit)

set firewall ipv4 name WAN-IN rule 10 description &quot;Allow established&quot;
set firewall ipv4 name WAN-IN rule 100 description &quot;Management: SSH from admin nets&quot;
set firewall ipv4 name WAN-IN rule 200 description &quot;App: Web servers HTTP/HTTPS&quot;
set firewall ipv4 name WAN-IN rule 999 description &quot;Default: Deny and log&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP Peer Naming&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Include AS number and purpose
set protocols bgp neighbor 10.0.0.1 description &quot;AS64512: Transit-ISP-Primary&quot;
set protocols bgp neighbor 10.0.0.2 description &quot;AS64513: Transit-ISP-Backup&quot;
set protocols bgp neighbor 192.168.1.1 description &quot;AS65001: Customer-Acme-Corp&quot;
set protocols bgp neighbor 172.16.1.1 description &quot;AS65000: iBGP-RR-Client-DC2&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Comment Strategy&lt;/h2&gt;
&lt;h3&gt;What to Comment&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Comment WHY, not WHAT
# The config shows what. Comments explain why.

# Bad: States the obvious
set firewall ipv4 name WAN-IN rule 100 description &quot;Allow TCP 22&quot;

# Good: Explains the reason
set firewall ipv4 name WAN-IN rule 100 description &quot;SSH: Admin access per SEC-2025-001&quot;

# Good: References ticket/change
set firewall ipv4 name WAN-IN rule 150 description &quot;Temp: Vendor access until 2025-03-01 - TKT-4521&quot;

# Good: Explains non-obvious choice
set firewall ipv4 name WAN-IN rule 200 description &quot;HTTP: Redirect only, actual traffic via reverse proxy&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Temporary Rules&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Always mark temporary rules with expiration
set firewall ipv4 name WAN-IN rule 500 description &quot;TEMP: Contractor VPN until 2025-02-28 - Remove after project X&quot;

# Create reminder
# Add to monitoring/ticketing system
# Set calendar reminder

# Pattern: TEMP: &amp;lt;purpose&amp;gt; until &amp;lt;date&amp;gt; - &amp;lt;context&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configuration Sections&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Use comments to mark logical sections

# === MANAGEMENT ACCESS ===
set firewall ipv4 name WAN-IN rule 100 ...
set firewall ipv4 name WAN-IN rule 101 ...

# === APPLICATION TRAFFIC ===
set firewall ipv4 name WAN-IN rule 200 ...

# VyOS doesn&apos;t have section comments in config, but you can use
# a rule with high number as section marker:

set firewall ipv4 name WAN-IN rule 99 action accept
set firewall ipv4 name WAN-IN rule 99 description &quot;=== MANAGEMENT SECTION ===&quot;
set firewall ipv4 name WAN-IN rule 99 state established enable
# This rule does nothing special but marks a section
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Firewall Groups&lt;/h2&gt;
&lt;p&gt;Groups are aliases that make rules readable and maintainable.&lt;/p&gt;
&lt;h3&gt;Network Groups&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Define once, use everywhere
set firewall group network-group ADMIN-NETS network 10.0.0.0/24
set firewall group network-group ADMIN-NETS network 192.168.100.0/24
set firewall group network-group ADMIN-NETS description &quot;Admin workstation networks&quot;

set firewall group network-group RFC1918 network 10.0.0.0/8
set firewall group network-group RFC1918 network 172.16.0.0/12
set firewall group network-group RFC1918 network 192.168.0.0/16
set firewall group network-group RFC1918 description &quot;Private address space&quot;

# Use in rules
set firewall ipv4 name WAN-IN rule 100 source group network-group ADMIN-NETS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Port Groups&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;set firewall group port-group WEB-PORTS port 80
set firewall group port-group WEB-PORTS port 443
set firewall group port-group WEB-PORTS description &quot;HTTP and HTTPS&quot;

set firewall group port-group MAIL-PORTS port 25
set firewall group port-group MAIL-PORTS port 465
set firewall group port-group MAIL-PORTS port 587
set firewall group port-group MAIL-PORTS port 993
set firewall group port-group MAIL-PORTS description &quot;Mail server ports&quot;

# Use in rules
set firewall ipv4 name WAN-IN rule 200 destination group port-group WEB-PORTS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Address Groups&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# For individual IPs
set firewall group address-group DNS-SERVERS address 8.8.8.8
set firewall group address-group DNS-SERVERS address 8.8.4.4
set firewall group address-group DNS-SERVERS address 1.1.1.1
set firewall group address-group DNS-SERVERS description &quot;Public DNS resolvers&quot;

set firewall group address-group NTP-SERVERS address 129.6.15.28
set firewall group address-group NTP-SERVERS address 129.6.15.29
set firewall group address-group NTP-SERVERS description &quot;NIST NTP servers&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Group Maintenance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# When a network changes, update the group - all rules update automatically

# Old approach (nightmare):
# Search every rule for 10.0.0.0/24, update each one

# Group approach:
show firewall group network-group ADMIN-NETS
# Update one place
set firewall group network-group ADMIN-NETS network 10.0.1.0/24
delete firewall group network-group ADMIN-NETS network 10.0.0.0/24
commit
# All rules using ADMIN-NETS now use new network
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Policy Structure&lt;/h2&gt;
&lt;h3&gt;Route Maps&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Consistent naming: &amp;lt;PURPOSE&amp;gt;-&amp;lt;DIRECTION&amp;gt;-&amp;lt;PEER-TYPE&amp;gt;

# For BGP
set policy route-map TRANSIT-IN-FILTER rule 10 ...
set policy route-map TRANSIT-OUT-FILTER rule 10 ...
set policy route-map CUSTOMER-IN-FILTER rule 10 ...
set policy route-map PEER-IN-FILTER rule 10 ...

# For redistribution
set policy route-map OSPF-TO-BGP rule 10 ...
set policy route-map BGP-TO-OSPF rule 10 ...
set policy route-map CONNECTED-TO-OSPF rule 10 ...
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prefix Lists&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Name clearly
set policy prefix-list BOGONS rule 10 prefix 0.0.0.0/8 le 32
set policy prefix-list BOGONS rule 10 action deny
set policy prefix-list BOGONS description &quot;Invalid source addresses&quot;

set policy prefix-list OUR-PREFIXES rule 10 prefix 203.0.113.0/24
set policy prefix-list OUR-PREFIXES description &quot;Our announced address space&quot;

set policy prefix-list DEFAULT-ONLY rule 10 prefix 0.0.0.0/0
set policy prefix-list DEFAULT-ONLY description &quot;Match only default route&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;AS Path Lists&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# For BGP filtering
set policy as-path-list CUSTOMER-AS rule 10 regex &quot;^65001$&quot;
set policy as-path-list CUSTOMER-AS description &quot;Customer AS65001 only&quot;

set policy as-path-list NO-TRANSIT rule 10 regex &quot;.*65000.*&quot;
set policy as-path-list NO-TRANSIT description &quot;Block routes through AS65000&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Configuration Templates&lt;/h2&gt;
&lt;h3&gt;Standard Router Sections&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Order your config consistently:

# 1. System settings
set system host-name router01
set system domain-name example.com
set system time-zone UTC

# 2. Users and authentication
set system login user admin ...

# 3. Interfaces
set interfaces ethernet eth0 ...

# 4. Firewall groups (define before using)
set firewall group ...

# 5. Firewall rules
set firewall ipv4 name ...

# 6. NAT
set nat source ...

# 7. Routing protocols
set protocols static ...
set protocols bgp ...

# 8. Services (DHCP, DNS, etc)
set service dhcp-server ...

# 9. VPN
set vpn ...

# 10. Zone policy
set zone-policy ...
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Change Documentation Template&lt;/h3&gt;
&lt;p&gt;When making changes, document:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;## Change: &amp;lt;Brief description&amp;gt;
Date: 2025-01-08
Ticket: TKT-12345
Author: admin

### Purpose
Why this change is needed.

### Changes
- Added firewall rule 150 for new application
- Updated ADMIN-NETS group with new subnet

### Testing
- Verified connectivity from admin network
- Confirmed application accessible

### Rollback
Commands to undo:
delete firewall ipv4 name WAN-IN rule 150
delete firewall group network-group ADMIN-NETS network 10.0.2.0/24
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Avoiding Common Mistakes&lt;/h2&gt;
&lt;h3&gt;Mistake 1: Magic Numbers&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: What is 10.5.32.15?
set firewall ipv4 name WAN-IN rule 100 source address 10.5.32.15

# Good: Use groups with descriptions
set firewall group address-group MONITORING-SERVERS address 10.5.32.15
set firewall group address-group MONITORING-SERVERS description &quot;Prometheus server&quot;
set firewall ipv4 name WAN-IN rule 100 source group address-group MONITORING-SERVERS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 2: No Rule Descriptions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: What does this do?
set firewall ipv4 name WAN-IN rule 47 action accept
set firewall ipv4 name WAN-IN rule 47 destination port 8443

# Good: Self-documenting
set firewall ipv4 name WAN-IN rule 200 action accept
set firewall ipv4 name WAN-IN rule 200 destination port 8443
set firewall ipv4 name WAN-IN rule 200 description &quot;App: Customer portal HTTPS&quot;
set firewall ipv4 name WAN-IN rule 200 protocol tcp
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 3: Inconsistent Numbering&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: Random rule numbers
set firewall ipv4 name WAN-IN rule 5 ...
set firewall ipv4 name WAN-IN rule 23 ...
set firewall ipv4 name WAN-IN rule 7 ...
set firewall ipv4 name WAN-IN rule 156 ...

# Good: Deliberate ranges
# 1-99: Infrastructure
# 100-199: Management
# 200-899: Applications
# 900-999: Cleanup/deny
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Good config is self-documenting. Bad config is a puzzle box.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Standards to adopt today:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Name things by purpose&lt;/strong&gt;: &lt;code&gt;WAN-TO-LAN&lt;/code&gt;, not &lt;code&gt;FIREWALL1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use groups&lt;/strong&gt;: Define once, maintain once&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number consistently&lt;/strong&gt;: Ranges for different rule types&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Describe everything&lt;/strong&gt;: Future you will thank present you&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reference tickets&lt;/strong&gt;: Link to change management&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The time spent on naming and comments pays back 100x when troubleshooting at 3 AM.&lt;/p&gt;
&lt;p&gt;A config you can read is a config you can fix. A config you can&apos;t read is a liability waiting to become an incident.&lt;/p&gt;
&lt;p&gt;Write configs for the next person. The next person might be you, six months from now, with no memory of why rule 47 exists.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Configuration Sessions: Parallel Work Without Conflicts</title><link>https://ashimov.com/posts/vyos-config-sessions/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-config-sessions/</guid><description>Master VyOS configuration sessions for team environments. Covers session isolation, concurrent editing, merge strategies, and why sessions prevent &quot;who changed what&quot; mysteries.</description><pubDate>Tue, 25 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Two engineers SSH into the same router. Both type &lt;code&gt;configure&lt;/code&gt;. Both make changes. Both commit. One overwrites the other&apos;s work. Nobody knows what happened until something breaks.&lt;/p&gt;
&lt;p&gt;VyOS configuration sessions solve this. Each session is isolated. Changes don&apos;t interfere. Conflicts are detected before commit, not after outage.&lt;/p&gt;
&lt;p&gt;Sessions prevent &quot;who changed what&quot; mysteries before they become incidents.&lt;/p&gt;
&lt;h2&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Without sessions, configuration mode is a shared space:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Engineer A                    Engineer B
-----------                   -----------
configure                     configure
set firewall rule 10...      set interfaces eth1...
                              commit  ← B&apos;s changes saved
commit  ← A overwrites B&apos;s changes!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Engineer B&apos;s changes are silently lost. No warning, no merge, just gone.&lt;/p&gt;
&lt;h2&gt;How Sessions Work&lt;/h2&gt;
&lt;p&gt;VyOS 1.4+ supports configuration sessions. Each session:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Has unique ID&lt;/li&gt;
&lt;li&gt;Isolated working copy of config&lt;/li&gt;
&lt;li&gt;Can see other sessions&lt;/li&gt;
&lt;li&gt;Detects conflicts on commit&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Engineer A (session abc123)    Engineer B (session def456)
--------------------------     --------------------------
configure                      configure
# Working on isolated copy     # Working on isolated copy
set firewall...                set interfaces...
# Changes only in abc123       # Changes only in def456
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Session Management&lt;/h2&gt;
&lt;h3&gt;View Active Sessions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# List all configuration sessions
show configuration sessions

# Example output:
# Session ID          User        Started
# abc123             admin        2025-01-08 10:15:03
# def456             engineer     2025-01-08 10:18:22
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Enter Named Session&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create or resume named session
configure session my-firewall-changes

# Session persists even if you disconnect
# SSH back in, resume:
configure session my-firewall-changes

# All your uncommitted changes are still there
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Discard Session&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Exit and discard changes
exit discard

# Or explicitly delete session
delete configuration session my-firewall-changes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conflict Detection&lt;/h2&gt;
&lt;h3&gt;What Happens on Conflict&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Engineer A in session abc123:
configure
set interfaces ethernet eth0 description &quot;WAN Link&quot;
commit  # Success

# Engineer B in session def456:
configure
set interfaces ethernet eth0 description &quot;Internet Uplink&quot;
commit  # ERROR: Configuration conflict detected

# Output:
# The following configuration conflicts were detected:
# interfaces ethernet eth0 description
#   Current: &quot;WAN Link&quot;
#   Your change: &quot;Internet Uplink&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Resolving Conflicts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Option 1: Refresh and redo
exit discard
configure
# See current config
show interfaces ethernet eth0
# Make decision, apply change
set interfaces ethernet eth0 description &quot;Internet Uplink&quot;
commit

# Option 2: Force your version
commit --force  # Overwrites conflicting values
# Use with caution - you&apos;re overwriting someone&apos;s work

# Option 3: Merge manually
compare  # See differences
# Adjust your changes to accommodate both
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Session Workflow Examples&lt;/h2&gt;
&lt;h3&gt;Example 1: Large Firewall Update&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Start named session for tracking
configure session firewall-2025-01-08

# Make many changes over time
set firewall ipv4 name WAN-IN rule 100 ...
set firewall ipv4 name WAN-IN rule 101 ...
set firewall ipv4 name WAN-IN rule 102 ...

# Save session, exit for lunch
exit

# Come back, continue work
configure session firewall-2025-01-08
set firewall ipv4 name WAN-IN rule 103 ...

# Review all changes
show | compare

# Commit when ready
commit-confirm 5
confirm

# Session automatically cleaned up after commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example 2: Team Coordination&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Engineer A: Working on BGP
configure session bgp-changes
set protocols bgp neighbor 10.0.0.1 ...

# Engineer B: Working on firewall (different session)
configure session firewall-changes
set firewall ipv4 name ...

# Both can work simultaneously
# Both can commit (no conflicts - different config sections)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example 3: Testing Before Commit&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create test session
configure session testing-nat

# Make changes
set nat source rule 100 ...

# Compare what would change
show | compare

# Show the commands that would be applied
show | commands

# Decide to discard
exit discard
# Or decide to apply
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Viewing Session Differences&lt;/h2&gt;
&lt;h3&gt;Compare Session to Running Config&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure session my-changes

# What would change if I commit?
show | compare

# Output shows:
# +set interfaces ethernet eth0 description &quot;New description&quot;
# -set interfaces ethernet eth0 description &quot;Old description&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Compare Sessions to Each Other&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Not directly supported, but you can:

# From session A, save diff
show | compare &amp;gt; /tmp/session-a-diff.txt

# From session B, save diff
show | compare &amp;gt; /tmp/session-b-diff.txt

# Compare files
diff /tmp/session-a-diff.txt /tmp/session-b-diff.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Session Timeout&lt;/h2&gt;
&lt;p&gt;Sessions don&apos;t persist forever:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Default: sessions expire after 30 minutes of inactivity

# Check current timeout
show configuration session-timeout

# Modify timeout (in minutes)
configure
set system session-timeout 60
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Persistent Sessions&lt;/h3&gt;
&lt;p&gt;For long-running work:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Keep session alive with periodic activity
# Or increase timeout significantly

# Check session status
show configuration sessions
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Use Named Sessions for Major Changes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: Anonymous session
configure  # What was I working on?

# Good: Descriptive name
configure session ticket-12345-add-bgp-peer
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. One Logical Change Per Session&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: Everything in one session
configure session all-my-stuff
set firewall ...
set bgp ...
set nat ...
# Huge commit, hard to rollback one thing

# Good: Separate sessions
configure session firewall-update
# ... only firewall changes

configure session bgp-peer-addition
# ... only BGP changes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Review Before Commit&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure session important-change

# Always review what you&apos;re committing
show | compare
show | commands

# Then commit
commit-confirm 5
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Clean Up Abandoned Sessions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# List old sessions
show configuration sessions

# Delete abandoned ones
delete configuration session old-test-session
delete configuration session johns-forgotten-work
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Limitations&lt;/h2&gt;
&lt;h3&gt;What Sessions Don&apos;t Do&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Don&apos;t provide history&lt;/strong&gt; - Sessions are temporary workspaces, not audit trails&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don&apos;t allow collaborative editing&lt;/strong&gt; - One user per session&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don&apos;t auto-merge&lt;/strong&gt; - Conflicts must be resolved manually&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don&apos;t persist across reboots&lt;/strong&gt; - Uncommitted sessions are lost&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;When Sessions Aren&apos;t Enough&lt;/h3&gt;
&lt;p&gt;For complex team workflows, consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Git-based configuration management&lt;/li&gt;
&lt;li&gt;Ansible/Terraform for deployments&lt;/li&gt;
&lt;li&gt;Separate staging environment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sessions handle concurrent edits. Version control handles history and rollback.&lt;/p&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Can&apos;t Enter Configuration Mode&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Another session might be locked
show configuration sessions

# If stuck session exists
# Try to take ownership (if you&apos;re sure)
configure session stuck-session
exit discard
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Lost Session Changes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Session timed out or router rebooted?
# Uncommitted changes are gone

# Always commit before:
# - Long breaks
# - Router maintenance
# - End of work day
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Conflict on Every Commit&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Another process might be modifying config

# Check for automation
show log | grep -i commit

# Coordinate with team
# Pause automated deployments during manual work
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Sessions prevent &quot;who changed what&quot; mysteries before they become incidents.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;On a shared router:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Without sessions: Chaos, overwrites, mystery changes&lt;/li&gt;
&lt;li&gt;With sessions: Isolation, conflict detection, clear ownership&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The few seconds to type &lt;code&gt;configure session descriptive-name&lt;/code&gt; saves hours of debugging &quot;why did this config disappear?&quot;&lt;/p&gt;
&lt;p&gt;Every team environment should use sessions. Every major change should use a named session. Every commit should follow a review.&lt;/p&gt;
&lt;p&gt;The router knows when you&apos;re about to overwrite someone&apos;s work. Let it tell you.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Commit-Confirm: Your Safety Net Against Self-Lockout</title><link>https://ashimov.com/posts/vyos-commit-confirm/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-commit-confirm/</guid><description>Master VyOS commit-confirm to prevent remote lockouts. Covers automatic rollback, confirmation workflow, timeout tuning, and why every remote change should use confirm.</description><pubDate>Fri, 21 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;You SSH into a remote router. Change firewall rules. Commit. Connection drops. You just locked yourself out. The router is 500 km away. Your Friday evening plans are cancelled.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;commit-confirm&lt;/code&gt; exists precisely for this scenario. It commits changes temporarily. If you don&apos;t confirm within a timeout, changes roll back automatically. The router saves itself from your mistakes.&lt;/p&gt;
&lt;p&gt;Every remote change should use &lt;code&gt;commit-confirm&lt;/code&gt;. No exceptions.&lt;/p&gt;
&lt;h2&gt;How Commit-Confirm Works&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;1. You run: commit-confirm 5
2. Changes are applied
3. Timer starts (5 minutes)
4. If you run &apos;confirm&apos; before timeout → changes persist
5. If timeout expires → automatic rollback to previous config
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The key insight: if your change breaks connectivity, you can&apos;t confirm. So the router reverts itself.&lt;/p&gt;
&lt;h2&gt;Basic Usage&lt;/h2&gt;
&lt;h3&gt;Standard Commit-Confirm&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enter configuration mode
configure

# Make your changes
set firewall ipv4 name WAN-IN rule 100 action drop
set firewall ipv4 name WAN-IN rule 100 source address 10.0.0.0/8

# Commit with 5-minute timeout
commit-confirm 5

# Test connectivity, verify everything works
# Then confirm:
confirm

# Or if something is wrong, just wait - it will rollback
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Different Timeout Values&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Quick change, confident it works (minimum 1 minute)
commit-confirm 1

# Complex change, need time to verify
commit-confirm 10

# Major change, need extensive testing
commit-confirm 30
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose timeout based on how long you need to verify the change works.&lt;/p&gt;
&lt;h3&gt;Check Remaining Time&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# See if there&apos;s a pending confirm
show system commit

# Output shows:
# Commit confirmed in X minutes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Real-World Scenarios&lt;/h2&gt;
&lt;h3&gt;Scenario 1: Firewall Rule Change&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Adding rule to allow new service
set firewall ipv4 name WAN-IN rule 50 action accept
set firewall ipv4 name WAN-IN rule 50 destination port 8443
set firewall ipv4 name WAN-IN rule 50 protocol tcp

# Use confirm because this touches firewall
commit-confirm 3

# Test from external client
# curl https://server:8443 - works!

# Confirm the change
confirm
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 2: Interface Address Change&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Changing management IP - very risky remotely
set interfaces ethernet eth0 address 192.168.1.100/24
delete interfaces ethernet eth0 address 192.168.1.50/24

# Critical: use confirm with enough time to reconnect
commit-confirm 5

# Immediately try to SSH to new IP
# ssh admin@192.168.1.100

# If you can connect:
confirm

# If you can&apos;t connect - wait 5 minutes, router reverts
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 3: Routing Change&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Changing default gateway
delete protocols static route 0.0.0.0/0
set protocols static route 0.0.0.0/0 next-hop 10.0.0.1

commit-confirm 3

# Verify connectivity
ping 8.8.8.8
traceroute 8.8.8.8

# Looks good
confirm
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What Happens on Rollback&lt;/h2&gt;
&lt;p&gt;When timeout expires without confirmation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# 1. Current session is terminated
# 2. Config reverts to pre-commit state
# 3. All services reload with old config
# 4. Log entry created

# Check what happened
show log | grep -i rollback
show log | grep -i commit

# View current config (should be pre-change)
show configuration
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Commit-Confirm vs Regular Commit&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;&lt;code&gt;commit&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;commit-confirm&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immediate&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Only after confirm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Automatic on timeout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remote safety&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use case&lt;/td&gt;
&lt;td&gt;Local changes&lt;/td&gt;
&lt;td&gt;Remote changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;h3&gt;1. Always Use for Remote Sessions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Script wrapper for remote changes
#!/bin/bash
# safe-commit.sh

TIMEOUT=${1:-5}  # Default 5 minutes

source /opt/vyatta/etc/functions/script-template
configure
# ... your changes ...
commit-confirm $TIMEOUT

echo &quot;Changes applied. Confirm with &apos;confirm&apos; within $TIMEOUT minutes&quot;
echo &quot;Or changes will automatically rollback&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Test Connectivity Immediately&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;commit-confirm 3

# Immediately verify you can still reach the router
# From another terminal:
ping router-ip
ssh admin@router-ip

# Only then confirm
confirm
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Have Console Access Ready&lt;/h3&gt;
&lt;p&gt;For major changes, have out-of-band access ready:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;IPMI/iLO console&lt;/li&gt;
&lt;li&gt;Serial console&lt;/li&gt;
&lt;li&gt;Local access&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If commit-confirm somehow doesn&apos;t save you, console access will.&lt;/p&gt;
&lt;h3&gt;4. Document the Rollback Happened&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# After a rollback, document it
show log | grep -i rollback

# Add comment explaining what was attempted
configure
comment firewall ipv4 name WAN-IN &quot;Note: rule 100 attempt on 2025-01-08 caused lockout, rolled back&quot;
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Edge Cases&lt;/h2&gt;
&lt;h3&gt;Confirm from Different Session&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Session 1: commit-confirm 5
# Session 1: loses connectivity

# Session 2: SSH into router (if possible via different path)
confirm  # This confirms Session 1&apos;s changes

# Useful when you have multiple network paths
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Immediate Rollback&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Changed your mind before timeout?
configure
rollback 0  # Rollback to last saved config
commit

# Or simply exit without confirming
exit discard
# Wait for timeout
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Power Failure During Confirm Window&lt;/h3&gt;
&lt;p&gt;If router reboots during commit-confirm window:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uncommitted changes are lost&lt;/li&gt;
&lt;li&gt;Router boots with last saved (pre-commit-confirm) config&lt;/li&gt;
&lt;li&gt;This is the safe behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Automating Confirm&lt;/h2&gt;
&lt;p&gt;For automated deployments, you need programmatic confirm:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Ansible approach
- name: Apply VyOS config with confirm
  vyos.vyos.vyos_config:
    lines:
      - set firewall ipv4 name TEST rule 10 action accept
    save: no

- name: Verify connectivity
  wait_for:
    host: &quot;{{ inventory_hostname }}&quot;
    port: 22
    timeout: 60

- name: Confirm changes
  vyos.vyos.vyos_command:
    commands:
      - confirm

# If verify fails, Ansible doesn&apos;t reach confirm
# Timeout expires, config rolls back
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;API-Based Confirm&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Using VyOS API
curl -X POST https://router/configure \
  -H &quot;Content-Type: application/json&quot; \
  -d &apos;{
    &quot;op&quot;: &quot;set&quot;,
    &quot;path&quot;: [&quot;firewall&quot;, &quot;ipv4&quot;, &quot;name&quot;, &quot;TEST&quot;],
    &quot;value&quot;: &quot;...&quot;
  }&apos;

# Apply with timeout
curl -X POST https://router/config-file \
  -d &apos;{&quot;op&quot;: &quot;commit-confirm&quot;, &quot;minutes&quot;: 5}&apos;

# Verify connectivity, then:
curl -X POST https://router/config-file \
  -d &apos;{&quot;op&quot;: &quot;confirm&quot;}&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;h3&gt;Mistake 1: Timeout Too Short&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bad: 1 minute for complex change
commit-confirm 1
# You&apos;re still verifying when it rolls back

# Better: Give yourself time
commit-confirm 10
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 2: Forgetting to Confirm&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;commit-confirm 5
# Test, looks good
# Get distracted
# 5 minutes pass
# Changes gone!

# Tip: Set a timer on your phone
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Mistake 3: Not Using It At All&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# The classic mistake
configure
set firewall ...  # Typo somewhere
commit  # No safety net
# Locked out
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Every remote change should use &lt;code&gt;commit-confirm&lt;/code&gt;. No exceptions.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The cost of using &lt;code&gt;commit-confirm&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Few seconds to type the command&lt;/li&gt;
&lt;li&gt;Remember to confirm&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The cost of not using it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Emergency drive to datacenter&lt;/li&gt;
&lt;li&gt;Out-of-band console access scramble&lt;/li&gt;
&lt;li&gt;Ruined weekend&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The router can save itself from your mistakes. Let it.&lt;/p&gt;
&lt;p&gt;I&apos;ve been saved by &lt;code&gt;commit-confirm&lt;/code&gt; more times than I care to admit. Every time I think &quot;this change is simple, I don&apos;t need confirm&quot; — that&apos;s exactly when I need it most.&lt;/p&gt;
&lt;p&gt;Use it. Always.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Automation &amp; GitOps for VyOS: Templates, Backups, Safe Deploy</title><link>https://ashimov.com/posts/vyos-automation/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-automation/</guid><description>Practical VyOS automation with Git, templates, and safe deployment practices. Covers config backup strategies, Jinja2 templates, Ansible integration, rollback procedures, and why automation reduces errors only if you have rules of the game.</description><pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every network incident postmortem I&apos;ve read includes some variation of &quot;a configuration change was made.&quot; Manual changes on production routers are the leading cause of outages. We know this. We still do it.&lt;/p&gt;
&lt;p&gt;Automation isn&apos;t about being fancy. It&apos;s about reducing the blast radius of human error. When configs live in Git, changes are reviewed before deployment, and rollback is one command away — you still make mistakes, but they&apos;re smaller and recoverable.&lt;/p&gt;
&lt;p&gt;This is how to automate VyOS configuration management in a way that actually works.&lt;/p&gt;
&lt;h2&gt;The Problem with Manual Configuration&lt;/h2&gt;
&lt;p&gt;Picture this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Need to add a firewall rule&lt;/li&gt;
&lt;li&gt;SSH into router&lt;/li&gt;
&lt;li&gt;Type commands from memory&lt;/li&gt;
&lt;li&gt;Typo in IP address&lt;/li&gt;
&lt;li&gt;Commit&lt;/li&gt;
&lt;li&gt;Traffic drops&lt;/li&gt;
&lt;li&gt;Panic&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now picture:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Edit rule in Git&lt;/li&gt;
&lt;li&gt;PR reviewed by colleague (catches typo)&lt;/li&gt;
&lt;li&gt;Merge triggers automated deploy&lt;/li&gt;
&lt;li&gt;Change applied&lt;/li&gt;
&lt;li&gt;If wrong, &lt;code&gt;git revert&lt;/code&gt; and redeploy&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Same change. One is an incident, one is Tuesday.&lt;/p&gt;
&lt;h2&gt;Config Backup Strategy&lt;/h2&gt;
&lt;p&gt;Before automating changes, automate backups. You need to recover from whatever you&apos;re about to break.&lt;/p&gt;
&lt;h3&gt;Manual Backup Commands&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Full config as set commands
show configuration commands &amp;gt; /config/backup-$(date +%Y%m%d).txt

# Config as JSON (useful for parsing)
show configuration json &amp;gt; /config/backup-$(date +%Y%m%d).json
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Automated Backup Script&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;/config/scripts/backup-config.sh&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash

BACKUP_DIR=&quot;/config/backups&quot;
DATE=$(date +%Y%m%d-%H%M%S)
HOSTNAME=$(hostname)
BACKUP_FILE=&quot;${BACKUP_DIR}/${HOSTNAME}-${DATE}.cfg&quot;

# Create backup directory
mkdir -p &quot;${BACKUP_DIR}&quot;

# Export config
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper begin
/opt/vyatta/bin/cli-shell-api showCfg --show-active-only &amp;gt; &quot;${BACKUP_FILE}&quot;
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper end

# Compress
gzip &quot;${BACKUP_FILE}&quot;

# Keep last 30 days
find &quot;${BACKUP_DIR}&quot; -name &quot;*.cfg.gz&quot; -mtime +30 -delete

# Optional: Push to remote storage
# scp &quot;${BACKUP_FILE}.gz&quot; backup-server:/backups/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule via cron:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure
set system task-scheduler task backup-config cron-spec &apos;0 * * * *&apos;
set system task-scheduler task backup-config executable path &apos;/config/scripts/backup-config.sh&apos;
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hourly backups, 30 days retention.&lt;/p&gt;
&lt;h3&gt;Off-Router Backup&lt;/h3&gt;
&lt;p&gt;Backups on the router die with the router. Push to external storage:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/backup-remote.sh

HOSTNAME=$(hostname)
DATE=$(date +%Y%m%d)
REMOTE=&quot;git@git.example.com:network/configs.git&quot;
WORK_DIR=&quot;/tmp/config-backup&quot;

# Clone repo
rm -rf &quot;${WORK_DIR}&quot;
git clone &quot;${REMOTE}&quot; &quot;${WORK_DIR}&quot;

# Export config
/opt/vyatta/bin/cli-shell-api showCfg --show-active-only &amp;gt; &quot;${WORK_DIR}/${HOSTNAME}.cfg&quot;

# Commit and push
cd &quot;${WORK_DIR}&quot;
git add &quot;${HOSTNAME}.cfg&quot;
git commit -m &quot;Automated backup: ${HOSTNAME} ${DATE}&quot; || true
git push

# Cleanup
rm -rf &quot;${WORK_DIR}&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now every config change is version-controlled, even manual ones.&lt;/p&gt;
&lt;h2&gt;Configuration as Code&lt;/h2&gt;
&lt;p&gt;Store your configs in Git from the start, not just as backups.&lt;/p&gt;
&lt;h3&gt;Repository Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;vyos-configs/
├── README.md
├── inventory/
│   ├── production.yml
│   └── staging.yml
├── templates/
│   ├── base/
│   │   ├── system.j2
│   │   ├── interfaces.j2
│   │   └── firewall.j2
│   └── roles/
│       ├── edge-router.j2
│       └── core-router.j2
├── vars/
│   ├── common.yml
│   └── per-router/
│       ├── router1.yml
│       └── router2.yml
├── configs/
│   ├── router1.cfg
│   └── router2.cfg
└── scripts/
    ├── generate.py
    ├── deploy.sh
    └── validate.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Jinja2 Templates&lt;/h2&gt;
&lt;p&gt;Templates let you define config patterns once and instantiate for each router.&lt;/p&gt;
&lt;h3&gt;Template Example&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;templates/base/interfaces.j2&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{# Interface configuration template #}

{% for iface in interfaces %}
set interfaces ethernet {{ iface.name }} address &apos;{{ iface.address }}&apos;
set interfaces ethernet {{ iface.name }} description &apos;{{ iface.description }}&apos;
{% if iface.vrrp is defined %}
set high-availability vrrp group {{ iface.vrrp.group }} interface &apos;{{ iface.name }}&apos;
set high-availability vrrp group {{ iface.vrrp.group }} virtual-address &apos;{{ iface.vrrp.vip }}&apos;
set high-availability vrrp group {{ iface.vrrp.group }} priority &apos;{{ iface.vrrp.priority }}&apos;
{% endif %}
{% endfor %}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Variables File&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;vars/per-router/router1.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;hostname: router1
router_id: 10.255.255.1

interfaces:
  - name: eth0
    address: 10.0.0.2/24
    description: LAN
    vrrp:
      group: LAN
      vip: 10.0.0.1/24
      priority: 200
  - name: eth1
    address: 203.0.113.2/24
    description: WAN
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Generation Script&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;scripts/generate.py&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/usr/bin/env python3
import yaml
import jinja2
import sys
from pathlib import Path

def generate_config(router_name):
    # Load variables
    common = yaml.safe_load(open(&apos;vars/common.yml&apos;))
    router = yaml.safe_load(open(f&apos;vars/per-router/{router_name}.yml&apos;))

    # Merge variables
    variables = {**common, **router}

    # Load templates
    env = jinja2.Environment(
        loader=jinja2.FileSystemLoader(&apos;templates&apos;),
        undefined=jinja2.StrictUndefined
    )

    # Render each template
    output = []
    for template_file in sorted(Path(&apos;templates/base&apos;).glob(&apos;*.j2&apos;)):
        template = env.get_template(f&apos;base/{template_file.name}&apos;)
        output.append(template.render(**variables))

    return &apos;\n&apos;.join(output)

if __name__ == &apos;__main__&apos;:
    router = sys.argv[1]
    config = generate_config(router)
    print(config)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Generate config:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;python scripts/generate.py router1 &amp;gt; configs/router1.cfg
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Ansible Integration&lt;/h2&gt;
&lt;p&gt;Ansible is the standard tool for network automation. VyOS has a collection.&lt;/p&gt;
&lt;h3&gt;Inventory&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;inventory/production.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;all:
  children:
    vyos_routers:
      hosts:
        router1:
          ansible_host: 10.0.0.2
        router2:
          ansible_host: 10.0.0.3
      vars:
        ansible_user: vyos
        ansible_network_os: vyos.vyos.vyos
        ansible_connection: ansible.netcommon.network_cli
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Playbook: Apply Configuration&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;playbooks/apply-config.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---
- name: Apply VyOS configuration
  hosts: vyos_routers
  gather_facts: no

  tasks:
    - name: Load configuration from file
      set_fact:
        config_lines: &quot;{{ lookup(&apos;file&apos;, &apos;configs/&apos; + inventory_hostname + &apos;.cfg&apos;).split(&apos;\n&apos;) }}&quot;

    - name: Apply configuration
      vyos.vyos.vyos_config:
        lines: &quot;{{ config_lines }}&quot;
        save: yes
      register: result

    - name: Show changes
      debug:
        var: result.commands
      when: result.changed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ansible-playbook -i inventory/production.yml playbooks/apply-config.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Playbook: Backup Before Change&lt;/h3&gt;
&lt;p&gt;Always backup before deploying:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---
- name: Safe configuration deployment
  hosts: vyos_routers
  gather_facts: no

  tasks:
    - name: Backup current configuration
      vyos.vyos.vyos_config:
        backup: yes
        backup_options:
          filename: &quot;{{ inventory_hostname }}-{{ ansible_date_time.iso8601 }}.cfg&quot;
          dir_path: ./backups/

    - name: Apply new configuration
      vyos.vyos.vyos_config:
        src: &quot;configs/{{ inventory_hostname }}.cfg&quot;
        save: yes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Safe Deployment Practices&lt;/h2&gt;
&lt;p&gt;Automation without safety is just faster mistakes.&lt;/p&gt;
&lt;h3&gt;1. Dry Run First&lt;/h3&gt;
&lt;p&gt;VyOS doesn&apos;t have a true dry-run, but you can compare:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# scripts/diff-config.sh

ROUTER=$1
NEW_CONFIG=$2

# Get current config
ssh vyos@${ROUTER} &apos;show configuration commands&apos; &amp;gt; /tmp/current.cfg

# Compare
diff -u /tmp/current.cfg &quot;${NEW_CONFIG}&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Review the diff before deploying.&lt;/p&gt;
&lt;h3&gt;2. Staged Rollout&lt;/h3&gt;
&lt;p&gt;Don&apos;t deploy to all routers at once:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Deploy to staging first
- hosts: staging_routers
  tasks:
    - include_tasks: apply-config.yml

# Wait and validate
- hosts: staging_routers
  tasks:
    - name: Wait for convergence
      pause:
        minutes: 5

    - name: Validate connectivity
      vyos.vyos.vyos_command:
        commands:
          - ping 8.8.8.8 count 3
      register: ping_result
      failed_when: &quot;&apos;0 received&apos; in ping_result.stdout[0]&quot;

# Only then production
- hosts: production_routers
  tasks:
    - include_tasks: apply-config.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Rollback Procedure&lt;/h3&gt;
&lt;p&gt;When things go wrong (they will), rollback fast:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# scripts/rollback.sh

ROUTER=$1
BACKUP_FILE=$2

echo &quot;Rolling back ${ROUTER} to ${BACKUP_FILE}&quot;

# Load backup config
ssh vyos@${ROUTER} &quot;configure; load ${BACKUP_FILE}; commit; save; exit&quot;

echo &quot;Rollback complete&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or with Ansible:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;- name: Emergency rollback
  hosts: &quot;{{ target_router }}&quot;
  gather_facts: no

  tasks:
    - name: Load backup configuration
      vyos.vyos.vyos_config:
        src: &quot;backups/{{ inventory_hostname }}-{{ backup_date }}.cfg&quot;
        save: yes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Change Windows&lt;/h3&gt;
&lt;p&gt;Automate deployment timing, not just deployment:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Only deploy during change window
- name: Check change window
  hosts: localhost
  tasks:
    - name: Verify time is within change window
      assert:
        that:
          - ansible_date_time.weekday in [&apos;Saturday&apos;, &apos;Sunday&apos;]
          - ansible_date_time.hour | int &amp;gt;= 2
          - ansible_date_time.hour | int &amp;lt;= 6
        fail_msg: &quot;Outside change window (Sat-Sun 02:00-06:00)&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. Validation After Deploy&lt;/h3&gt;
&lt;p&gt;Don&apos;t just deploy and hope:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;- name: Post-deployment validation
  hosts: vyos_routers
  tasks:
    - name: Check BGP sessions
      vyos.vyos.vyos_command:
        commands:
          - show ip bgp summary
      register: bgp_status

    - name: Verify BGP established
      assert:
        that:
          - &quot;&apos;Established&apos; in bgp_status.stdout[0]&quot;
        fail_msg: &quot;BGP session not established!&quot;

    - name: Check VRRP status
      vyos.vyos.vyos_command:
        commands:
          - show vrrp
      register: vrrp_status

    - name: Check route count
      vyos.vyos.vyos_command:
        commands:
          - show ip route summary
      register: route_count
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;GitOps Workflow&lt;/h2&gt;
&lt;p&gt;Full GitOps: Git is the source of truth. Changes go through Git, not directly to routers.&lt;/p&gt;
&lt;h3&gt;Workflow&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Engineer creates branch
2. Edits config in vars/ or templates/
3. Runs generate.py locally
4. Commits generated config
5. Opens PR
6. Colleague reviews diff
7. CI validates (syntax, linting)
8. PR merged
9. CD pipeline deploys to routers
10. Monitoring confirms success
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CI Pipeline (GitHub Actions Example)&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;.github/workflows/validate.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;name: Validate Config

on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: &apos;3.11&apos;

      - name: Install dependencies
        run: pip install jinja2 pyyaml

      - name: Generate configs
        run: |
          for router in vars/per-router/*.yml; do
            name=$(basename $router .yml)
            python scripts/generate.py $name &amp;gt; configs/$name.cfg
          done

      - name: Check for config drift
        run: |
          git diff --exit-code configs/
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CD Pipeline&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;.github/workflows/deploy.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;name: Deploy Config

on:
  push:
    branches: [main]
    paths:
      - &apos;configs/**&apos;

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Ansible
        run: |
          pip install ansible
          ansible-galaxy collection install vyos.vyos

      - name: Deploy to staging
        run: |
          ansible-playbook -i inventory/staging.yml playbooks/apply-config.yml

      - name: Validate staging
        run: |
          ansible-playbook -i inventory/staging.yml playbooks/validate.yml

      - name: Deploy to production
        run: |
          ansible-playbook -i inventory/production.yml playbooks/apply-config.yml
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Automation reduces manual errors — if you have rules of the game.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Automation without process is just automated mistakes. The value comes from:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Version control&lt;/strong&gt;: Every change tracked, reviewable, revertible&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code review&lt;/strong&gt;: Someone else catches your typos&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Testing&lt;/strong&gt;: Validate before production&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Staged rollout&lt;/strong&gt;: Break staging, not production&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast rollback&lt;/strong&gt;: Recover in minutes, not hours&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The router config should never be edited directly. Changes flow through Git. If it&apos;s not in Git, it didn&apos;t happen (or it shouldn&apos;t have).&lt;/p&gt;
&lt;p&gt;Start small. Automate backups first — that&apos;s pure upside. Then move to templated configs. Then add Ansible deployment. Then CI/CD. Each step reduces risk and increases confidence.&lt;/p&gt;
&lt;p&gt;The goal isn&apos;t to eliminate human involvement. It&apos;s to move humans from &quot;typing commands at 2 AM&quot; to &quot;reviewing diffs in daylight.&quot; That&apos;s where we make fewer mistakes.&lt;/p&gt;
</content:encoded><category>vyos</category><category>automation</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>High Availability: VRRP + State Sync (What You Can and Can&apos;t Do)</title><link>https://ashimov.com/posts/vyos-ha/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-ha/</guid><description>Honest guide to VyOS high availability using VRRP and conntrack sync. Covers failover configuration, state synchronization, what actually fails over and what doesn&apos;t, testing procedures, and why HA is a set of failure scenarios, not a checkbox.</description><pubDate>Fri, 14 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;High availability sounds simple: two routers, one fails, the other takes over. Users don&apos;t notice. Uptime maintained. Check the HA box and move on.&lt;/p&gt;
&lt;p&gt;Reality is messier. VRRP can fail over an IP address in seconds. But what about active connections? NAT state? BGP sessions? Firewall sessions? Some of this can be synchronized. Some can&apos;t. Some can, but with caveats that matter.&lt;/p&gt;
&lt;p&gt;This is an honest guide to VyOS HA. What works, what doesn&apos;t, and how to test so you find out before production does.&lt;/p&gt;
&lt;h2&gt;VRRP Basics&lt;/h2&gt;
&lt;p&gt;VRRP (Virtual Router Redundancy Protocol) provides a virtual IP (VIP) shared between two or more routers. One is master, others are backup. If the master fails, a backup takes the VIP.&lt;/p&gt;
&lt;p&gt;Clients point to the VIP. They don&apos;t care which physical router is answering.&lt;/p&gt;
&lt;h3&gt;Basic VRRP Configuration&lt;/h3&gt;
&lt;p&gt;Two VyOS routers: R1 (primary) and R2 (backup).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;R1 (Primary):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set interfaces ethernet eth0 address &apos;10.0.0.2/24&apos;
set interfaces ethernet eth0 description &apos;LAN&apos;

set high-availability vrrp group LAN vrid &apos;10&apos;
set high-availability vrrp group LAN interface &apos;eth0&apos;
set high-availability vrrp group LAN virtual-address &apos;10.0.0.1/24&apos;
set high-availability vrrp group LAN priority &apos;200&apos;
set high-availability vrrp group LAN preempt

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;R2 (Backup):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set interfaces ethernet eth0 address &apos;10.0.0.3/24&apos;
set interfaces ethernet eth0 description &apos;LAN&apos;

set high-availability vrrp group LAN vrid &apos;10&apos;
set high-availability vrrp group LAN interface &apos;eth0&apos;
set high-availability vrrp group LAN virtual-address &apos;10.0.0.1/24&apos;
set high-availability vrrp group LAN priority &apos;100&apos;
set high-availability vrrp group LAN preempt

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;vrid&lt;/strong&gt;: Virtual Router ID. Must match on both routers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;virtual-address&lt;/strong&gt;: The shared IP clients use (10.0.0.1).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;priority&lt;/strong&gt;: Higher wins. R1 at 200 beats R2 at 100.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;preempt&lt;/strong&gt;: If R1 recovers, it reclaims master status.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Verify:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show vrrp
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multiple VRRP Groups&lt;/h2&gt;
&lt;p&gt;Most routers have multiple interfaces. Each needs its own VRRP group:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# LAN side
set high-availability vrrp group LAN vrid &apos;10&apos;
set high-availability vrrp group LAN interface &apos;eth0&apos;
set high-availability vrrp group LAN virtual-address &apos;10.0.0.1/24&apos;
set high-availability vrrp group LAN priority &apos;200&apos;

# WAN side
set high-availability vrrp group WAN vrid &apos;20&apos;
set high-availability vrrp group WAN interface &apos;eth1&apos;
set high-availability vrrp group WAN virtual-address &apos;203.0.113.1/24&apos;
set high-availability vrrp group WAN priority &apos;200&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Sync Groups: Fail Together&lt;/h2&gt;
&lt;p&gt;If LAN interface fails but WAN is fine, you want BOTH to fail over. Otherwise, traffic enters one router and tries to exit another — asymmetric routing disaster.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set high-availability vrrp sync-group MAIN member &apos;LAN&apos;
set high-availability vrrp sync-group MAIN member &apos;WAN&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now if either interface fails, both VRRP groups fail over together.&lt;/p&gt;
&lt;h2&gt;What VRRP Does NOT Do&lt;/h2&gt;
&lt;p&gt;VRRP fails over IP addresses. That&apos;s it. It does NOT automatically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transfer active TCP connections&lt;/li&gt;
&lt;li&gt;Sync NAT translation tables&lt;/li&gt;
&lt;li&gt;Maintain firewall connection state&lt;/li&gt;
&lt;li&gt;Preserve BGP sessions&lt;/li&gt;
&lt;li&gt;Sync DHCP leases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For that, you need additional sync mechanisms.&lt;/p&gt;
&lt;h2&gt;Connection Tracking Sync (Conntrack)&lt;/h2&gt;
&lt;p&gt;VyOS can synchronize its connection tracking table between routers. This means established connections (TCP sessions, NAT translations) survive failover.&lt;/p&gt;
&lt;h3&gt;Conntrack Sync Configuration&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;On both routers:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define sync interface (dedicated link between routers)
set service conntrack-sync interface &apos;eth2&apos;
set service conntrack-sync failover-mechanism vrrp sync-group &apos;MAIN&apos;
set service conntrack-sync accept-protocol &apos;tcp,udp,icmp&apos;

# Optional: Exclude local traffic
set service conntrack-sync ignore-address ipv4-address &apos;10.0.0.2&apos;
set service conntrack-sync ignore-address ipv4-address &apos;10.0.0.3&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dedicated interface between routers (eth2 in this example)&lt;/li&gt;
&lt;li&gt;This interface should be direct (crossover) or on isolated VLAN&lt;/li&gt;
&lt;li&gt;NOT through the same switch that might fail&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Conntrack Sync Does&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Syncs NAT translation table (internal→external mappings)&lt;/li&gt;
&lt;li&gt;Syncs connection states (ESTABLISHED, RELATED)&lt;/li&gt;
&lt;li&gt;Allows TCP connections to survive failover (mostly)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Conntrack Sync Does NOT Do&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Guarantee zero packet loss during failover&lt;/li&gt;
&lt;li&gt;Sync application-layer state&lt;/li&gt;
&lt;li&gt;Help with stateless protocols beyond basic tracking&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The &quot;Mostly&quot; Caveat&lt;/h3&gt;
&lt;p&gt;TCP connections &lt;em&gt;can&lt;/em&gt; survive, but:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Packets in flight are lost.&lt;/strong&gt; During failover (typically 1-3 seconds), packets are dropped. TCP will retransmit, but there&apos;s a gap.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sequence number issues.&lt;/strong&gt; Sometimes the new master&apos;s kernel disagrees about TCP sequence numbers. Connection may reset.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Asymmetric routing.&lt;/strong&gt; If return traffic goes to wrong router, connections break. Sync groups help, but network design matters.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Realistic expectation:&lt;/strong&gt; Long-lived connections (SSH sessions, database connections) usually survive. Short requests during failover may fail and retry. Users experience a brief hiccup, not a disconnect.&lt;/p&gt;
&lt;h2&gt;What You CAN&apos;T Sync&lt;/h2&gt;
&lt;h3&gt;BGP Sessions&lt;/h3&gt;
&lt;p&gt;BGP sessions are between your router and peer. When you fail over:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New master has different source IP (its physical IP)&lt;/li&gt;
&lt;li&gt;Peer sees different neighbor&lt;/li&gt;
&lt;li&gt;BGP session must re-establish&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This takes seconds to minutes depending on timers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Use aggressive BGP timers, BFD, and accept that BGP convergence is part of failover time.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Aggressive keepalive (3s) and hold (9s)
set protocols bgp neighbor 198.51.100.1 timers keepalive &apos;3&apos;
set protocols bgp neighbor 198.51.100.1 timers holdtime &apos;9&apos;

# BFD for faster detection
set protocols bgp neighbor 198.51.100.1 bfd

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;IPsec Tunnels&lt;/h3&gt;
&lt;p&gt;IPsec SAs are bound to specific IPs. On failover:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;IKE SAs re-negotiate&lt;/li&gt;
&lt;li&gt;Child SAs re-establish&lt;/li&gt;
&lt;li&gt;Tunnel is down for seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Use DPD (Dead Peer Detection) with short intervals. Accept brief tunnel downtime.&lt;/p&gt;
&lt;h3&gt;Routing Protocol State&lt;/h3&gt;
&lt;p&gt;OSPF neighbor relationships, BGP tables — none of this syncs. The new master starts fresh:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OSPF: Neighbors detect failure via dead interval, then re-elect&lt;/li&gt;
&lt;li&gt;BGP: Sessions reset, routes re-exchanged&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Application Sessions&lt;/h3&gt;
&lt;p&gt;If you&apos;re running services on VyOS (rare, but possible):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DHCP leases: Can sync with ISC DHCP failover, but VyOS config is separate&lt;/li&gt;
&lt;li&gt;DNS cache: Lost&lt;/li&gt;
&lt;li&gt;Any local state: Lost&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Testing Failover&lt;/h2&gt;
&lt;p&gt;HA that isn&apos;t tested is HA that doesn&apos;t work. Test before production.&lt;/p&gt;
&lt;h3&gt;Test 1: Clean Failover&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On primary, simulate failure
sudo ip link set eth0 down

# Watch secondary take over
show vrrp

# Verify traffic flows
# From a client, ping the VIP continuously
ping 10.0.0.1

# Restore
sudo ip link set eth0 up
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 2: Primary Recovery (Preemption)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Ensure preempt is enabled
# Take down primary, let secondary take over
# Bring primary back up
# Verify primary reclaims master
show vrrp
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 3: Connection Survival&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Start long-running connection through router
# SSH through the VIP to a server on the other side
ssh user@server-behind-router

# Fail over primary
# Check if SSH session survives
# It should pause briefly then continue
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 4: Split Brain&lt;/h3&gt;
&lt;p&gt;What if the sync link fails but both routers are up?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Disconnect eth2 (sync interface)
# Both routers think they&apos;re alone
# Both might become master = split brain

# VyOS should still function, but conntrack sync stops
# This is a failure mode to document, not prevent
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 5: Dual Failure&lt;/h3&gt;
&lt;p&gt;What if both routers fail?&lt;/p&gt;
&lt;p&gt;This isn&apos;t HA&apos;s job. HA handles single failures. Document that dual failure = outage and size your expectations.&lt;/p&gt;
&lt;h2&gt;VRRP Tuning&lt;/h2&gt;
&lt;h3&gt;Advertisement Interval&lt;/h3&gt;
&lt;p&gt;Default is 1 second. Faster detection = faster failover, but more network traffic.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set high-availability vrrp group LAN advertise-interval &apos;1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For sub-second failover, some use 0.1-0.5 seconds. Be careful — this is more sensitive to network jitter.&lt;/p&gt;
&lt;h3&gt;Preempt Delay&lt;/h3&gt;
&lt;p&gt;When primary recovers, don&apos;t immediately preempt. Let it stabilize:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set high-availability vrrp group LAN preempt-delay &apos;30&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Primary must be up for 30 seconds before reclaiming master. Prevents flapping.&lt;/p&gt;
&lt;h3&gt;Health Check Scripts&lt;/h3&gt;
&lt;p&gt;Fail over based on more than interface status:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set high-availability vrrp group LAN health-check script &apos;/config/scripts/check-uplink.sh&apos;
set high-availability vrrp group LAN health-check interval &apos;5&apos;
set high-availability vrrp group LAN health-check failure-count &apos;3&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Example script (&lt;code&gt;/config/scripts/check-uplink.sh&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# Check if upstream is reachable
ping -c 1 -W 1 198.51.100.1 &amp;gt; /dev/null 2&amp;gt;&amp;amp;1
exit $?
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If script returns non-zero 3 times, VRRP fails over.&lt;/p&gt;
&lt;h2&gt;Realistic HA Architecture&lt;/h2&gt;
&lt;p&gt;A production VyOS HA setup:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                    ┌─────────────────┐
                    │    Internet     │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
        ┌─────┴─────┐               ┌───────┴───────┐
        │  R1 (Pri) │───sync link───│  R2 (Backup)  │
        │ eth1: WAN │     eth2      │  eth1: WAN    │
        │ eth0: LAN │               │  eth0: LAN    │
        └─────┬─────┘               └───────┬───────┘
              │ VIP: 10.0.0.1               │
              └──────────────┬──────────────┘
                             │
                    ┌────────┴────────┐
                    │   LAN Switch    │
                    └────────┬────────┘
                             │
                    ┌────────┴────────┐
                    │     Clients     │
                    └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dedicated sync link (eth2) — not through the LAN switch&lt;/li&gt;
&lt;li&gt;Both routers connect to same LAN switch (single point of failure, but usually acceptable)&lt;/li&gt;
&lt;li&gt;VIP is what clients use&lt;/li&gt;
&lt;li&gt;If the LAN switch fails, both routers are useless anyway&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;HA is not a checkbox. It&apos;s a set of failure scenarios and tests.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;VRRP gives you IP failover in seconds. Conntrack sync gives you connection state (mostly). But:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BGP sessions reset&lt;/li&gt;
&lt;li&gt;IPsec tunnels re-establish&lt;/li&gt;
&lt;li&gt;Application state is lost&lt;/li&gt;
&lt;li&gt;Brief packet loss happens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;HA means: single router failure doesn&apos;t cause outage. It doesn&apos;t mean zero impact. Users may see a brief hiccup. Long connections survive but might stutter. This is acceptable for most use cases.&lt;/p&gt;
&lt;p&gt;What makes HA work isn&apos;t the configuration — it&apos;s the testing. Every failure scenario you test is one you understand. Every one you skip is one that surprises you at 3 AM.&lt;/p&gt;
&lt;p&gt;Document your failure modes. Test your failover. Know exactly what happens when the primary dies. That&apos;s HA.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>ha</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>VRF &amp; Segmentation: When VLANs Aren&apos;t Enough</title><link>https://ashimov.com/posts/vyos-vrf/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-vrf/</guid><description>Using VRF on VyOS for network isolation that goes beyond VLANs. Covers VRF creation, inter-VRF routing, route leaking, firewalling between VRFs, and maintaining a clear mental model of your segmentation.</description><pubDate>Tue, 11 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;VLANs give you Layer 2 separation. Different broadcast domains, different subnets. But they all share the same routing table. When a server in VLAN 10 wants to reach a server in VLAN 20, your router sees both networks, compares the destination to its single routing table, and forwards.&lt;/p&gt;
&lt;p&gt;VRF (Virtual Routing and Forwarding) gives you something VLANs can&apos;t: completely separate routing tables. Traffic in VRF &quot;Production&quot; has no idea that VRF &quot;Management&quot; exists. They&apos;re parallel universes on the same hardware.&lt;/p&gt;
&lt;p&gt;When do you need this? When VLAN isolation isn&apos;t enough. Multi-tenant environments, management plane separation, compliance requirements, or just wanting to ensure that a routing mistake in one segment can&apos;t affect another.&lt;/p&gt;
&lt;h2&gt;VRF Fundamentals&lt;/h2&gt;
&lt;p&gt;A VRF is an isolated routing instance. It has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Its own routing table&lt;/li&gt;
&lt;li&gt;Its own interfaces&lt;/li&gt;
&lt;li&gt;Its own routing protocols (BGP, OSPF, static routes)&lt;/li&gt;
&lt;li&gt;No visibility into other VRFs unless you explicitly leak routes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Think of it as running multiple virtual routers on one box.&lt;/p&gt;
&lt;h2&gt;Creating VRFs on VyOS&lt;/h2&gt;
&lt;p&gt;Basic VRF setup:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create VRFs
set vrf name PRODUCTION description &apos;Production workloads&apos;
set vrf name PRODUCTION table &apos;100&apos;

set vrf name MANAGEMENT description &apos;Management and monitoring&apos;
set vrf name MANAGEMENT table &apos;200&apos;

set vrf name DMZ description &apos;Public-facing services&apos;
set vrf name DMZ table &apos;300&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;table&lt;/code&gt; parameter is the routing table ID. Each VRF needs a unique one.&lt;/p&gt;
&lt;h2&gt;Assigning Interfaces to VRFs&lt;/h2&gt;
&lt;p&gt;Interfaces must be assigned to a VRF. An interface can only belong to one VRF at a time.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Physical interface in Production
set interfaces ethernet eth1 vrf &apos;PRODUCTION&apos;
set interfaces ethernet eth1 address &apos;10.1.0.1/24&apos;
set interfaces ethernet eth1 description &apos;Production LAN&apos;

# VLAN interfaces in different VRFs
set interfaces ethernet eth0 vif 100 vrf &apos;MANAGEMENT&apos;
set interfaces ethernet eth0 vif 100 address &apos;10.100.0.1/24&apos;

set interfaces ethernet eth0 vif 200 vrf &apos;DMZ&apos;
set interfaces ethernet eth0 vif 200 address &apos;10.200.0.1/24&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Once an interface is in a VRF, its routes only exist in that VRF&apos;s table. The main routing table won&apos;t see them.&lt;/p&gt;
&lt;h2&gt;Viewing VRF Status&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# List all VRFs
show vrf

# Show routes in a specific VRF
show ip route vrf PRODUCTION

# Show interfaces in a VRF
show vrf PRODUCTION

# Ping from a specific VRF
ping 10.1.0.5 vrf PRODUCTION
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Static Routes in VRFs&lt;/h2&gt;
&lt;p&gt;Static routes can be VRF-specific:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Default route in Production VRF
set protocols static route 0.0.0.0/0 next-hop 10.1.0.254 vrf &apos;PRODUCTION&apos;

# Specific route in Management VRF
set protocols static route 10.0.0.0/8 next-hop 10.100.0.254 vrf &apos;MANAGEMENT&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Running Routing Protocols in VRFs&lt;/h2&gt;
&lt;p&gt;Each VRF can run its own routing protocol instances:&lt;/p&gt;
&lt;h3&gt;OSPF per VRF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# OSPF in Production VRF
set protocols ospf vrf PRODUCTION area 0 network &apos;10.1.0.0/24&apos;
set protocols ospf vrf PRODUCTION parameters router-id &apos;10.1.0.1&apos;
set protocols ospf vrf PRODUCTION redistribute connected

# OSPF in Management VRF (completely separate)
set protocols ospf vrf MANAGEMENT area 0 network &apos;10.100.0.0/24&apos;
set protocols ospf vrf MANAGEMENT parameters router-id &apos;10.100.0.1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;BGP per VRF&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# BGP for Production (AS 65001)
set protocols bgp vrf PRODUCTION system-as &apos;65001&apos;
set protocols bgp vrf PRODUCTION neighbor 10.1.0.254 remote-as &apos;65000&apos;

# BGP for DMZ (different AS or same, depending on design)
set protocols bgp vrf DMZ system-as &apos;65002&apos;
set protocols bgp vrf DMZ neighbor 10.200.0.254 remote-as &apos;65000&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Inter-VRF Routing (Route Leaking)&lt;/h2&gt;
&lt;p&gt;Sometimes you need controlled communication between VRFs. A jump host in Management needs to reach Production. A monitoring server needs visibility into all VRFs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 1: Static route leaking&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Leak Production network into Management VRF
set protocols static route 10.1.0.0/24 interface &apos;eth1&apos; vrf &apos;MANAGEMENT&apos;
set protocols static route 10.1.0.0/24 next-hop-vrf &apos;PRODUCTION&apos; vrf &apos;MANAGEMENT&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This tells the Management VRF: &quot;to reach 10.1.0.0/24, look in the Production VRF&apos;s routing table.&quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2: Import/Export with BGP (MP-BGP)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For complex scenarios, use BGP to import/export routes between VRFs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define route distinguisher and route targets
set protocols bgp vrf PRODUCTION address-family ipv4-unicast rd &apos;65000:100&apos;
set protocols bgp vrf PRODUCTION address-family ipv4-unicast route-target export &apos;65000:100&apos;
set protocols bgp vrf PRODUCTION address-family ipv4-unicast route-target import &apos;65000:200&apos;

set protocols bgp vrf MANAGEMENT address-family ipv4-unicast rd &apos;65000:200&apos;
set protocols bgp vrf MANAGEMENT address-family ipv4-unicast route-target export &apos;65000:200&apos;
set protocols bgp vrf MANAGEMENT address-family ipv4-unicast route-target import &apos;65000:100&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the MPLS VPN model applied to VRFs. Routes tagged with RT 65000:100 (from Production) get imported into Management, and vice versa.&lt;/p&gt;
&lt;h2&gt;Firewalling Between VRFs&lt;/h2&gt;
&lt;p&gt;Leaking routes doesn&apos;t mean allowing all traffic. Use firewall rules to control inter-VRF communication:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Zone-based approach
set firewall zone PRODUCTION interface &apos;eth1&apos;
set firewall zone MANAGEMENT interface &apos;eth0.100&apos;
set firewall zone DMZ interface &apos;eth0.200&apos;

# Policy: Management can reach Production (SSH only)
set firewall ipv4 name MGMT-TO-PROD default-action &apos;drop&apos;
set firewall ipv4 name MGMT-TO-PROD rule 10 action &apos;accept&apos;
set firewall ipv4 name MGMT-TO-PROD rule 10 destination port &apos;22&apos;
set firewall ipv4 name MGMT-TO-PROD rule 10 protocol &apos;tcp&apos;
set firewall ipv4 name MGMT-TO-PROD rule 10 state &apos;new&apos;

# Policy: Production cannot initiate to Management
set firewall ipv4 name PROD-TO-MGMT default-action &apos;drop&apos;
set firewall ipv4 name PROD-TO-MGMT rule 10 action &apos;accept&apos;
set firewall ipv4 name PROD-TO-MGMT rule 10 state &apos;established,related&apos;

# Apply zone policies
set firewall zone PRODUCTION from MANAGEMENT firewall name &apos;MGMT-TO-PROD&apos;
set firewall zone MANAGEMENT from PRODUCTION firewall name &apos;PROD-TO-MGMT&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VRF for Management Plane Separation&lt;/h2&gt;
&lt;p&gt;A common pattern: keep management traffic completely separate from data traffic.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Management VRF
set vrf name MGMT table &apos;999&apos;

# Management interface
set interfaces ethernet eth0 vrf &apos;MGMT&apos;
set interfaces ethernet eth0 address &apos;192.168.255.1/24&apos;
set interfaces ethernet eth0 description &apos;Out-of-band management&apos;

# SSH binds to Management VRF
set service ssh vrf &apos;MGMT&apos;

# SNMP in Management VRF
set service snmp vrf &apos;MGMT&apos;
set service snmp community public authorization &apos;ro&apos;

# NTP in Management VRF
set service ntp vrf &apos;MGMT&apos;
set service ntp server 192.168.255.10

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now SSH, SNMP, and NTP only work through the management interface. Someone on the production network can&apos;t SSH to the router&apos;s production-facing IP — that IP isn&apos;t listening.&lt;/p&gt;
&lt;h2&gt;The Mental Model&lt;/h2&gt;
&lt;p&gt;VRF makes complex networks manageable, but only if you maintain a clear mental model. Here&apos;s how to think about it:&lt;/p&gt;
&lt;h3&gt;1. Draw Your VRF Topology&lt;/h3&gt;
&lt;p&gt;Before configuring, diagram:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What VRFs exist&lt;/li&gt;
&lt;li&gt;What interfaces belong to each&lt;/li&gt;
&lt;li&gt;What routes leak between them&lt;/li&gt;
&lt;li&gt;What firewall rules control inter-VRF traffic&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                    VyOS Router                      │
├─────────────┬─────────────────┬────────────────────┤
│    MGMT     │   PRODUCTION    │        DMZ         │
│  Table 999  │    Table 100    │     Table 300      │
├─────────────┼─────────────────┼────────────────────┤
│ eth0        │ eth1            │ eth2               │
│ 192.168.x.x │ 10.1.0.0/16     │ 10.200.0.0/24      │
├─────────────┴─────────────────┴────────────────────┤
│ Route Leaking:                                      │
│ - MGMT → PROD: 10.1.0.0/24 (monitoring)            │
│ - PROD → DMZ: 10.200.0.0/24 (web backends)         │
│ - DMZ → PROD: NONE (DMZ can&apos;t initiate to prod)   │
└─────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Default to Isolation&lt;/h3&gt;
&lt;p&gt;Start with VRFs completely isolated. Only leak routes when there&apos;s a documented requirement. Every route leak is a potential security boundary crossing.&lt;/p&gt;
&lt;h3&gt;3. Document Why&lt;/h3&gt;
&lt;p&gt;For every inter-VRF route, document:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Why it&apos;s needed&lt;/li&gt;
&lt;li&gt;Who approved it&lt;/li&gt;
&lt;li&gt;What firewall rules protect it&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# In VyOS config, use descriptions
set protocols static route 10.1.0.0/24 next-hop-vrf &apos;PRODUCTION&apos; vrf &apos;MANAGEMENT&apos;
# Description: &quot;Monitoring servers in MGMT need to reach Prod for health checks. Approved: 2024-01. Protected by MGMT-TO-PROD firewall rules.&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Test Isolation&lt;/h3&gt;
&lt;p&gt;Regularly verify that isolation is working:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# From a host in DMZ VRF, try to reach Management — should fail
ping 192.168.255.1 vrf DMZ
# Should timeout or be rejected

# Verify routes aren&apos;t leaking unexpectedly
show ip route vrf DMZ
# Should NOT see Management networks
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Real-World Example: Multi-Tenant Router&lt;/h2&gt;
&lt;p&gt;Service provider scenario: one VyOS router handles multiple customers, each in their own VRF.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Customer A
set vrf name CUSTOMER-A table &apos;1001&apos;
set interfaces ethernet eth1 vif 100 vrf &apos;CUSTOMER-A&apos;
set interfaces ethernet eth1 vif 100 address &apos;10.100.1.1/24&apos;
set protocols bgp vrf CUSTOMER-A system-as &apos;65100&apos;
set protocols bgp vrf CUSTOMER-A neighbor 10.100.1.2 remote-as &apos;65100&apos;

# Customer B
set vrf name CUSTOMER-B table &apos;1002&apos;
set interfaces ethernet eth1 vif 200 vrf &apos;CUSTOMER-B&apos;
set interfaces ethernet eth1 vif 200 address &apos;10.100.2.1/24&apos;
set protocols bgp vrf CUSTOMER-B system-as &apos;65200&apos;
set protocols bgp vrf CUSTOMER-B neighbor 10.100.2.2 remote-as &apos;65200&apos;

# Customers cannot see each other — complete isolation
# Each has their own BGP session, their own routes, their own world

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Customer A and B could both use 10.0.0.0/8 internally. No conflict — they&apos;re in different routing tables.&lt;/p&gt;
&lt;h2&gt;Troubleshooting VRFs&lt;/h2&gt;
&lt;p&gt;Common issues:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Interface not in VRF&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show interfaces ethernet eth1
# Check &quot;VRF:&quot; field — should show your VRF name
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;2. Routes not appearing&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show ip route vrf PRODUCTION
# If empty, check interface addresses and that interface is up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;3. Traffic not flowing between VRFs&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check routes exist in source VRF
show ip route vrf MANAGEMENT 10.1.0.0/24

# Check firewall isn&apos;t blocking
show firewall statistics

# Check return path — VRF routing must work both directions
show ip route vrf PRODUCTION 10.100.0.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;4. Services not binding to VRF&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Verify service is configured for VRF
show configuration commands | grep &quot;vrf&quot;

# Check socket bindings
ss -tlnp | grep sshd
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;VRF simplifies complex networks if you keep the model in your head.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;VLANs isolate at Layer 2. VRF isolates at Layer 3. Combined, they give you complete network segmentation. But VRF adds complexity — every interface, every route, every service needs to be VRF-aware.&lt;/p&gt;
&lt;p&gt;The key is maintaining a clear mental model. Know which interfaces are in which VRF. Know what routes leak between them. Know what firewall rules protect those leaks. When you have that model, VRF makes complex multi-tenant or highly-segmented networks manageable. When you lose track of that model, VRF becomes a debugging nightmare.&lt;/p&gt;
&lt;p&gt;Start simple. One or two VRFs. Get comfortable with VRF-aware commands. Expand when you need to. And always, always diagram your VRF topology before you build it.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>routing</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>RPKI/IRR Filtering Strategy: Practical, Not Academic</title><link>https://ashimov.com/posts/vyos-rpki-irr/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-rpki-irr/</guid><description>Real-world BGP route validation using RPKI and IRR on VyOS. Covers validator setup, policy storage, prefix validation workflow, and why filtering is a process, not a single configuration.</description><pubDate>Fri, 07 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;RPKI and IRR filtering aren&apos;t academic exercises. They&apos;re the difference between a stable network and being part of someone else&apos;s route hijack. Every year we see incidents where networks accept hijacked prefixes because they didn&apos;t validate. The tools exist. The question is whether you use them.&lt;/p&gt;
&lt;p&gt;This isn&apos;t a theoretical overview. This is how to actually implement route validation on VyOS and maintain it over time. Because filtering isn&apos;t something you configure once — it&apos;s an ongoing process.&lt;/p&gt;
&lt;h2&gt;The Problem We&apos;re Solving&lt;/h2&gt;
&lt;p&gt;BGP trusts what peers tell it. If your upstream sends you a route for 8.8.8.0/24, BGP will accept it (unless you filter). If that route is a hijack, you&apos;re now directing your users to an attacker. RPKI and IRR give us ways to validate that routes are legitimate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RPKI (Resource Public Key Infrastructure)&lt;/strong&gt;: Cryptographic validation. Route Origin Authorizations (ROAs) are signed by the prefix owner and published in repositories. You query a validator, get the validation state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IRR (Internet Routing Registry)&lt;/strong&gt;: Database-driven. Networks register their routing policy (what they originate, what they accept). You build filters from these registrations.&lt;/p&gt;
&lt;p&gt;Neither is perfect. Use both.&lt;/p&gt;
&lt;h2&gt;RPKI Validation on VyOS&lt;/h2&gt;
&lt;h3&gt;Step 1: Run an RPKI Validator&lt;/h3&gt;
&lt;p&gt;VyOS connects to an RPKI validator via the RTR (RPKI-to-Router) protocol. You need a validator running somewhere — this can be on the VyOS router itself (limited resources) or preferably on a separate server.&lt;/p&gt;
&lt;p&gt;Popular validators:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Routinator&lt;/strong&gt; (NLnet Labs) — Rust, lightweight, recommended&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FORT Validator&lt;/strong&gt; — C, LACNIC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;rpki-client&lt;/strong&gt; — OpenBSD team, very lightweight&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example: Running Routinator on a separate Linux server:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On your validator server (Debian/Ubuntu)
apt install routinator

# Configure /etc/routinator/routinator.conf
[server]
rtr-listen = [&quot;0.0.0.0:3323&quot;]
http-listen = [&quot;0.0.0.0:8323&quot;]

# Start and enable
systemctl enable --now routinator
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Configure VyOS as RPKI Client&lt;/h3&gt;
&lt;p&gt;Connect VyOS to your validator:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define RPKI cache server
set protocols rpki cache VALIDATOR address &apos;10.0.0.50&apos;
set protocols rpki cache VALIDATOR port &apos;3323&apos;
set protocols rpki cache VALIDATOR preference &apos;1&apos;

# Optional: Set polling interval (default 300 seconds)
set protocols rpki polling-period &apos;300&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify the connection:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show rpki cache-connection
show rpki prefix-table
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see prefixes with their validation states: valid, invalid, or not-found.&lt;/p&gt;
&lt;h3&gt;Step 3: Create Validation Policy&lt;/h3&gt;
&lt;p&gt;Here&apos;s where RPKI becomes useful. Create route-maps that act on validation state:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Route-map for upstream: reject invalid, accept valid and unknown
set policy route-map UPSTREAM-RPKI rule 10 action &apos;deny&apos;
set policy route-map UPSTREAM-RPKI rule 10 match rpki &apos;invalid&apos;

set policy route-map UPSTREAM-RPKI rule 20 action &apos;permit&apos;
set policy route-map UPSTREAM-RPKI rule 20 match rpki &apos;valid&apos;

set policy route-map UPSTREAM-RPKI rule 30 action &apos;permit&apos;
set policy route-map UPSTREAM-RPKI rule 30 match rpki &apos;notfound&apos;
set policy route-map UPSTREAM-RPKI rule 30 set local-preference &apos;90&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Why accept notfound?&lt;/strong&gt; RPKI adoption isn&apos;t 100%. Rejecting unknown would break connectivity to many legitimate destinations. Instead, we lower preference — valid routes are preferred when available.&lt;/p&gt;
&lt;h3&gt;Step 4: Apply to BGP Sessions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map import &apos;UPSTREAM-RPKI&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;IRR-Based Filtering&lt;/h2&gt;
&lt;p&gt;RPKI tells you if the origin AS is authorized. IRR tells you what prefixes a network &lt;em&gt;claims&lt;/em&gt; to originate and what their routing policy is. Use IRR to build prefix-lists.&lt;/p&gt;
&lt;h3&gt;Generating Filters from IRR&lt;/h3&gt;
&lt;p&gt;You don&apos;t manually maintain these filters. Tools like &lt;strong&gt;bgpq4&lt;/strong&gt; generate them:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Install bgpq4
apt install bgpq4

# Generate prefix-list for AS-EXAMPLE (an AS-SET)
bgpq4 -4 -l CUSTOMER-PREFIXES AS-EXAMPLE

# Output (VyOS-compatible format)
bgpq4 -4 -l CUSTOMER-PREFIXES -F &apos;set policy prefix-list %n rule %N action permit\nset policy prefix-list %n rule %N prefix %i/%l\n&apos; AS-EXAMPLE
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Example output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set policy prefix-list CUSTOMER-PREFIXES rule 10 action permit
set policy prefix-list CUSTOMER-PREFIXES rule 10 prefix 203.0.113.0/24
set policy prefix-list CUSTOMER-PREFIXES rule 20 action permit
set policy prefix-list CUSTOMER-PREFIXES rule 20 prefix 198.51.100.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Automation Is Required&lt;/h3&gt;
&lt;p&gt;IRR data changes. New prefixes get registered, old ones removed. If you generate filters once and forget, they become stale. This must be automated:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /opt/scripts/update-irr-filters.sh

CUSTOMER_AS=&quot;AS-EXAMPLE&quot;
OUTPUT=&quot;/tmp/customer-prefixes.vyos&quot;

# Generate VyOS commands
bgpq4 -4 -l CUSTOMER-IN -F &apos;set policy prefix-list %n rule %N action permit\nset policy prefix-list %n rule %N prefix %i/%l\n&apos; $CUSTOMER_AS &amp;gt; $OUTPUT

# Add deny rule at end
echo &quot;set policy prefix-list CUSTOMER-IN rule 9999 action deny&quot; &amp;gt;&amp;gt; $OUTPUT
echo &quot;set policy prefix-list CUSTOMER-IN rule 9999 prefix 0.0.0.0/0 le 32&quot; &amp;gt;&amp;gt; $OUTPUT

# Apply to VyOS (use vbash for configure mode)
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper begin
while read line; do
    /opt/vyatta/sbin/vyatta-cfg-cmd-wrapper $line
done &amp;lt; $OUTPUT
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper commit
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper end
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run this on a schedule (cron job) — daily or when you receive notification of policy changes.&lt;/p&gt;
&lt;h2&gt;Where to Store Policy&lt;/h2&gt;
&lt;p&gt;Filtering configuration gets complex. Where should it live?&lt;/p&gt;
&lt;h3&gt;Option 1: On the Router (Simple but Limited)&lt;/h3&gt;
&lt;p&gt;For small deployments, keep everything in VyOS config. Works, but doesn&apos;t scale and no version control.&lt;/p&gt;
&lt;h3&gt;Option 2: Git Repository (Recommended)&lt;/h3&gt;
&lt;p&gt;Store all policies in Git:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;network-policy/
├── prefix-lists/
│   ├── customers/
│   │   ├── customer-a.txt
│   │   └── customer-b.txt
│   └── bogons.txt
├── as-sets/
│   └── customer-as-sets.txt
├── route-maps/
│   └── templates/
└── scripts/
    ├── generate-filters.sh
    └── deploy.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Version history (who changed what, when)&lt;/li&gt;
&lt;li&gt;Review process (PRs before deployment)&lt;/li&gt;
&lt;li&gt;Rollback capability&lt;/li&gt;
&lt;li&gt;Documentation alongside config&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Option 3: Automation Platform (Ansible/Nornir)&lt;/h3&gt;
&lt;p&gt;For larger networks, use configuration management:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Ansible playbook
- name: Update BGP filters
  hosts: routers
  tasks:
    - name: Generate prefix-lists from IRR
      local_action:
        module: shell
        cmd: bgpq4 -4 -l {{ customer.name }}-IN {{ customer.as_set }}
      register: prefix_list

    - name: Apply prefix-lists
      vyos.vyos.vyos_config:
        lines: &quot;{{ prefix_list.stdout_lines }}&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Validation Workflow&lt;/h2&gt;
&lt;p&gt;Here&apos;s a practical workflow that actually gets maintained:&lt;/p&gt;
&lt;h3&gt;1. Onboarding New Peers/Customers&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Customer request → Verify AS/prefix ownership →
Add to IRR monitoring → Generate initial filters →
Apply with soft-reconfiguration → Monitor logs
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Ongoing Maintenance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Weekly:
  - Regenerate IRR-based filters (automated)
  - Review RPKI invalid count (should be zero or near-zero)
  - Check validator health

Monthly:
  - Review filter hit counts (unused rules?)
  - Verify customer IRR registrations current
  - Test failover to backup validators
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Incident Response&lt;/h3&gt;
&lt;p&gt;When you see RPKI invalid or unexpected routes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check specific prefix
show rpki prefix 203.0.113.0/24

# See what the validator says
show rpki cache-server

# Check where route is coming from
show ip bgp 203.0.113.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Combined Policy Example&lt;/h2&gt;
&lt;p&gt;Real-world inbound filter combining RPKI and prefix-list:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Bogon prefix-list (never accept these)
set policy prefix-list BOGONS rule 10 action &apos;permit&apos;
set policy prefix-list BOGONS rule 10 prefix &apos;0.0.0.0/8&apos;
set policy prefix-list BOGONS rule 10 le &apos;32&apos;

set policy prefix-list BOGONS rule 20 action &apos;permit&apos;
set policy prefix-list BOGONS rule 20 prefix &apos;10.0.0.0/8&apos;
set policy prefix-list BOGONS rule 20 le &apos;32&apos;

set policy prefix-list BOGONS rule 30 action &apos;permit&apos;
set policy prefix-list BOGONS rule 30 prefix &apos;127.0.0.0/8&apos;
set policy prefix-list BOGONS rule 30 le &apos;32&apos;

# ... add remaining bogons

# Combined route-map
set policy route-map UPSTREAM-IN rule 10 action &apos;deny&apos;
set policy route-map UPSTREAM-IN rule 10 match ip address prefix-list &apos;BOGONS&apos;
set policy route-map UPSTREAM-IN rule 10 description &apos;Reject bogons&apos;

set policy route-map UPSTREAM-IN rule 20 action &apos;deny&apos;
set policy route-map UPSTREAM-IN rule 20 match rpki &apos;invalid&apos;
set policy route-map UPSTREAM-IN rule 20 description &apos;Reject RPKI invalid&apos;

set policy route-map UPSTREAM-IN rule 30 action &apos;permit&apos;
set policy route-map UPSTREAM-IN rule 30 match rpki &apos;valid&apos;
set policy route-map UPSTREAM-IN rule 30 set local-preference &apos;110&apos;
set policy route-map UPSTREAM-IN rule 30 description &apos;Prefer RPKI valid&apos;

set policy route-map UPSTREAM-IN rule 100 action &apos;permit&apos;
set policy route-map UPSTREAM-IN rule 100 description &apos;Accept remaining&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring and Alerting&lt;/h2&gt;
&lt;p&gt;Filtering without monitoring is incomplete. Track:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Number of RPKI-invalid routes (should alert if &amp;gt; 0 accepted)
show rpki prefix-table | grep -c Invalid

# Prefix counts by validation state
show ip bgp summary
show rpki cache-server
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set up alerts when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RPKI validator connection drops&lt;/li&gt;
&lt;li&gt;Number of invalid routes increases suddenly&lt;/li&gt;
&lt;li&gt;Prefix count changes dramatically&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Filtering is a process, not a config.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You can&apos;t configure BGP filters once and forget them. Policies change, customers add prefixes, hijacks happen. Effective filtering requires:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Automated generation (IRR → filters)&lt;/li&gt;
&lt;li&gt;Cryptographic validation (RPKI)&lt;/li&gt;
&lt;li&gt;Version control (Git)&lt;/li&gt;
&lt;li&gt;Regular review (is this still correct?)&lt;/li&gt;
&lt;li&gt;Monitoring (are filters working?)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The networks that avoid hijack incidents aren&apos;t lucky — they have processes that keep their filters current. The ones that become headlines treated filtering as a one-time task.&lt;/p&gt;
&lt;p&gt;Start with RPKI (it&apos;s the lowest effort for significant protection), add IRR-based filters for customers and peers, automate the maintenance, and review regularly. That&apos;s the practical path to route validation that actually works.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>routing</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>BGP on VyOS: Filters Are Not Optional</title><link>https://ashimov.com/posts/vyos-bgp/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-bgp/</guid><description>BGP fundamentals on VyOS using FRR. Covers eBGP/iBGP setup, prefix-lists, route-maps, communities, max-prefix protection, and why BGP without filtering is an incident waiting to happen.</description><pubDate>Tue, 04 Nov 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;BGP is the protocol that runs the internet. It&apos;s also the protocol that can take down the internet — one misconfiguration, one missing filter, and you&apos;re either leaking routes or accepting someone else&apos;s garbage. Every major BGP incident (and there are many) comes down to the same thing: missing or inadequate filtering.&lt;/p&gt;
&lt;p&gt;VyOS uses FRR (Free Range Routing) as its BGP implementation. It&apos;s battle-tested and feature-complete. But FRR, like any BGP implementation, will happily accept and advertise whatever you tell it to. It&apos;s your job to tell it the right things.&lt;/p&gt;
&lt;h2&gt;BGP Fundamentals&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;eBGP (External BGP)&lt;/strong&gt;: Between different Autonomous Systems (AS). Your connection to ISPs, IXPs, or other organizations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;iBGP (Internal BGP)&lt;/strong&gt;: Within the same AS. Distributes external routes internally.&lt;/p&gt;
&lt;p&gt;Key difference: eBGP changes the AS path on advertisements. iBGP doesn&apos;t — routes learned via iBGP keep their original AS path.&lt;/p&gt;
&lt;h2&gt;Basic eBGP Configuration&lt;/h2&gt;
&lt;p&gt;Typical scenario: You have AS 65000, connecting to an upstream provider (AS 12345).&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define your AS and router ID
set protocols bgp system-as &apos;65000&apos;
set protocols bgp parameters router-id &apos;203.0.113.1&apos;

# Define neighbor (your upstream)
set protocols bgp neighbor 198.51.100.1 remote-as &apos;12345&apos;
set protocols bgp neighbor 198.51.100.1 description &apos;Upstream Provider&apos;
set protocols bgp neighbor 198.51.100.1 update-source &apos;203.0.113.1&apos;

# Announce your prefix
set protocols bgp address-family ipv4-unicast network 203.0.113.0/24

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Never run this in production.&lt;/strong&gt; This configuration has no filters — you&apos;ll accept whatever your upstream sends (including their full table if they misconfigure) and potentially leak routes.&lt;/p&gt;
&lt;h2&gt;The Golden Rule: Always Filter&lt;/h2&gt;
&lt;p&gt;BGP without filters is an incident waiting to happen. Every BGP session needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Inbound filter&lt;/strong&gt;: What routes you accept&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Outbound filter&lt;/strong&gt;: What routes you advertise&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;max-prefix limit&lt;/strong&gt;: Protection against route leaks from peers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No exceptions.&lt;/p&gt;
&lt;h2&gt;Prefix Lists&lt;/h2&gt;
&lt;p&gt;Prefix lists define which IP prefixes to match:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# What you own and want to advertise
set policy prefix-list MY-PREFIXES rule 10 action &apos;permit&apos;
set policy prefix-list MY-PREFIXES rule 10 prefix &apos;203.0.113.0/24&apos;

# What you accept from upstream (example: allow all but filter specifics)
set policy prefix-list UPSTREAM-IN rule 10 action &apos;deny&apos;
set policy prefix-list UPSTREAM-IN rule 10 prefix &apos;0.0.0.0/0&apos;
set policy prefix-list UPSTREAM-IN rule 10 description &apos;Deny default unless explicitly wanted&apos;

set policy prefix-list UPSTREAM-IN rule 20 action &apos;deny&apos;
set policy prefix-list UPSTREAM-IN rule 20 prefix &apos;10.0.0.0/8&apos;
set policy prefix-list UPSTREAM-IN rule 20 le &apos;32&apos;
set policy prefix-list UPSTREAM-IN rule 20 description &apos;Deny RFC1918&apos;

set policy prefix-list UPSTREAM-IN rule 30 action &apos;deny&apos;
set policy prefix-list UPSTREAM-IN rule 30 prefix &apos;172.16.0.0/12&apos;
set policy prefix-list UPSTREAM-IN rule 30 le &apos;32&apos;

set policy prefix-list UPSTREAM-IN rule 40 action &apos;deny&apos;
set policy prefix-list UPSTREAM-IN rule 40 prefix &apos;192.168.0.0/16&apos;
set policy prefix-list UPSTREAM-IN rule 40 le &apos;32&apos;

set policy prefix-list UPSTREAM-IN rule 1000 action &apos;permit&apos;
set policy prefix-list UPSTREAM-IN rule 1000 prefix &apos;0.0.0.0/0&apos;
set policy prefix-list UPSTREAM-IN rule 1000 le &apos;24&apos;
set policy prefix-list UPSTREAM-IN rule 1000 description &apos;Accept /24 and larger&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prefix List Syntax&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;prefix&lt;/strong&gt;: The IP prefix to match&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;le&lt;/strong&gt; (less than or equal): Match prefixes with length up to this value&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ge&lt;/strong&gt; (greater than or equal): Match prefixes with length at least this value&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example: &lt;code&gt;prefix 10.0.0.0/8 le 24 ge 16&lt;/code&gt; matches any prefix within 10.0.0.0/8 that is /16 to /24.&lt;/p&gt;
&lt;h2&gt;Route Maps&lt;/h2&gt;
&lt;p&gt;Route maps combine conditions with actions. They&apos;re the Swiss Army knife of BGP policy:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Outbound: Only advertise my prefixes, set correct attributes
set policy route-map TO-UPSTREAM rule 10 action &apos;permit&apos;
set policy route-map TO-UPSTREAM rule 10 match ip address prefix-list &apos;MY-PREFIXES&apos;
set policy route-map TO-UPSTREAM rule 10 set local-preference &apos;100&apos;

set policy route-map TO-UPSTREAM rule 1000 action &apos;deny&apos;
set policy route-map TO-UPSTREAM rule 1000 description &apos;Deny everything else&apos;

# Inbound: Filter bad routes, set local-pref based on source
set policy route-map FROM-UPSTREAM rule 10 action &apos;deny&apos;
set policy route-map FROM-UPSTREAM rule 10 match ip address prefix-list &apos;BOGONS&apos;
set policy route-map FROM-UPSTREAM rule 10 description &apos;Drop bogons&apos;

set policy route-map FROM-UPSTREAM rule 100 action &apos;permit&apos;
set policy route-map FROM-UPSTREAM rule 100 match ip address prefix-list &apos;UPSTREAM-IN&apos;
set policy route-map FROM-UPSTREAM rule 100 set local-preference &apos;200&apos;

set policy route-map FROM-UPSTREAM rule 1000 action &apos;deny&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Apply Route Maps to Neighbors&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map import &apos;FROM-UPSTREAM&apos;
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map export &apos;TO-UPSTREAM&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Max-Prefix Protection&lt;/h2&gt;
&lt;p&gt;This is your safety valve. If a peer sends more prefixes than expected, the session tears down:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Expect ~10 prefixes, warn at 8, shut down at 10
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix &apos;10&apos;
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix threshold &apos;80&apos;

# For full table peers (~900k prefixes as of 2024)
set protocols bgp neighbor 198.51.100.2 address-family ipv4-unicast maximum-prefix &apos;1000000&apos;
set protocols bgp neighbor 198.51.100.2 address-family ipv4-unicast maximum-prefix threshold &apos;90&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When max-prefix triggers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Session goes down&lt;/li&gt;
&lt;li&gt;Log entry created&lt;/li&gt;
&lt;li&gt;Manual intervention required (or restart-timer)&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# Auto-restart after 30 minutes (use carefully)
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix restart &apos;30&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;BGP Communities&lt;/h2&gt;
&lt;p&gt;Communities are tags attached to routes. Use them to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Signal intent to upstreams (traffic engineering)&lt;/li&gt;
&lt;li&gt;Control route propagation&lt;/li&gt;
&lt;li&gt;Implement policy at scale&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Standard Communities&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define community list
set policy community-list BLACKHOLE rule 10 action &apos;permit&apos;
set policy community-list BLACKHOLE rule 10 regex &apos;65000:666&apos;

# Set community in route-map
set policy route-map SET-COMMUNITY rule 10 action &apos;permit&apos;
set policy route-map SET-COMMUNITY rule 10 set community &apos;65000:100&apos;

# Match community in route-map
set policy route-map MATCH-COMMUNITY rule 10 action &apos;permit&apos;
set policy route-map MATCH-COMMUNITY rule 10 match community &apos;BLACKHOLE&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Well-Known Communities&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Community&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;no-export&lt;/td&gt;
&lt;td&gt;Don&apos;t advertise outside the AS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no-advertise&lt;/td&gt;
&lt;td&gt;Don&apos;t advertise to any peer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;local-as&lt;/td&gt;
&lt;td&gt;Don&apos;t advertise outside the local confederation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no-peer&lt;/td&gt;
&lt;td&gt;Don&apos;t advertise to eBGP peers (RFC 3765)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre&gt;&lt;code&gt;# Example: Mark routes as no-export
set policy route-map INTERNAL-ONLY rule 10 set community &apos;no-export&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Blackhole Communities&lt;/h3&gt;
&lt;p&gt;Most transit providers support blackhole communities. When you&apos;re under DDoS attack:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Advertise the attacked IP with blackhole community
# (community varies by provider — check their documentation)
set policy route-map BLACKHOLE-ANNOUNCE rule 10 action &apos;permit&apos;
set policy route-map BLACKHOLE-ANNOUNCE rule 10 match ip address prefix-list &apos;ATTACKED-IP&apos;
set policy route-map BLACKHOLE-ANNOUNCE rule 10 set community &apos;12345:666&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The upstream drops traffic to that prefix at their edge. You stop receiving the DDoS.&lt;/p&gt;
&lt;h2&gt;iBGP Configuration&lt;/h2&gt;
&lt;p&gt;iBGP distributes routes within your AS. Different rules apply:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No AS path modification&lt;/li&gt;
&lt;li&gt;Full mesh required (or use route reflectors)&lt;/li&gt;
&lt;li&gt;Next-hop often needs adjustment&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Basic iBGP&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

set protocols bgp neighbor 10.0.0.2 remote-as &apos;65000&apos;
set protocols bgp neighbor 10.0.0.2 description &apos;iBGP peer - Router 2&apos;
set protocols bgp neighbor 10.0.0.2 update-source &apos;lo&apos;

# Next-hop-self for eBGP routes advertised via iBGP
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast nexthop-self

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Why Next-Hop-Self?&lt;/h3&gt;
&lt;p&gt;When you learn a route via eBGP, the next-hop is the eBGP peer&apos;s IP. If you advertise this to iBGP neighbors, they need a route to that external IP — which they might not have.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;next-hop-self&lt;/code&gt; rewrites the next-hop to your own address, which iBGP peers definitely can reach.&lt;/p&gt;
&lt;h3&gt;Route Reflectors&lt;/h3&gt;
&lt;p&gt;Full mesh iBGP doesn&apos;t scale. With N routers, you need N*(N-1)/2 sessions. Route reflectors solve this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On route reflector
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-reflector-client
set protocols bgp neighbor 10.0.0.3 address-family ipv4-unicast route-reflector-client
set protocols bgp neighbor 10.0.0.4 address-family ipv4-unicast route-reflector-client

# Clients just peer with reflector, not each other
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Route reflector modifies the normal iBGP advertisement rules to re-advertise routes to other iBGP peers.&lt;/p&gt;
&lt;h2&gt;Local Preference&lt;/h2&gt;
&lt;p&gt;Local preference determines which path to use when you have multiple options. Higher is preferred.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Prefer primary upstream (local-pref 200) over backup (local-pref 100)
set policy route-map FROM-PRIMARY rule 10 set local-preference &apos;200&apos;
set policy route-map FROM-BACKUP rule 10 set local-preference &apos;100&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Local preference is only meaningful within your AS — it&apos;s not advertised to eBGP peers.&lt;/p&gt;
&lt;h2&gt;AS Path Prepending&lt;/h2&gt;
&lt;p&gt;Make your routes less attractive to specific upstreams (traffic engineering):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Prepend your AS twice when advertising to backup upstream
set policy route-map TO-BACKUP rule 10 action &apos;permit&apos;
set policy route-map TO-BACKUP rule 10 match ip address prefix-list &apos;MY-PREFIXES&apos;
set policy route-map TO-BACKUP rule 10 set as-path prepend &apos;65000 65000&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The longer AS path makes this route less preferred. Traffic should prefer the primary path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;: Prepending more than 3x is usually pointless. Some networks filter very long AS paths.&lt;/p&gt;
&lt;h2&gt;BGP Timers&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# Keepalive every 30 seconds, hold time 90 seconds
set protocols bgp neighbor 198.51.100.1 timers keepalive &apos;30&apos;
set protocols bgp neighbor 198.51.100.1 timers holdtime &apos;90&apos;

# For faster failover (aggressive)
set protocols bgp neighbor 198.51.100.1 timers keepalive &apos;10&apos;
set protocols bgp neighbor 198.51.100.1 timers holdtime &apos;30&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For sub-second failover, use BFD instead of aggressive BGP timers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set protocols bgp neighbor 198.51.100.1 bfd
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Bogon Filtering&lt;/h2&gt;
&lt;p&gt;Always filter bogons — prefixes that should never appear in the global routing table:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# IPv4 Bogons (not exhaustive — use a maintained list)
set policy prefix-list BOGONS-V4 rule 10 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 10 prefix &apos;0.0.0.0/8&apos;
set policy prefix-list BOGONS-V4 rule 10 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 20 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 20 prefix &apos;10.0.0.0/8&apos;
set policy prefix-list BOGONS-V4 rule 20 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 30 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 30 prefix &apos;100.64.0.0/10&apos;
set policy prefix-list BOGONS-V4 rule 30 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 40 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 40 prefix &apos;127.0.0.0/8&apos;
set policy prefix-list BOGONS-V4 rule 40 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 50 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 50 prefix &apos;169.254.0.0/16&apos;
set policy prefix-list BOGONS-V4 rule 50 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 60 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 60 prefix &apos;172.16.0.0/12&apos;
set policy prefix-list BOGONS-V4 rule 60 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 70 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 70 prefix &apos;192.0.0.0/24&apos;
set policy prefix-list BOGONS-V4 rule 70 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 80 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 80 prefix &apos;192.0.2.0/24&apos;
set policy prefix-list BOGONS-V4 rule 80 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 90 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 90 prefix &apos;192.168.0.0/16&apos;
set policy prefix-list BOGONS-V4 rule 90 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 100 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 100 prefix &apos;198.18.0.0/15&apos;
set policy prefix-list BOGONS-V4 rule 100 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 110 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 110 prefix &apos;198.51.100.0/24&apos;
set policy prefix-list BOGONS-V4 rule 110 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 120 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 120 prefix &apos;203.0.113.0/24&apos;
set policy prefix-list BOGONS-V4 rule 120 le &apos;32&apos;

set policy prefix-list BOGONS-V4 rule 130 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 130 prefix &apos;224.0.0.0/3&apos;
set policy prefix-list BOGONS-V4 rule 130 le &apos;32&apos;

# Apply in route-map
set policy route-map FROM-PEER rule 10 action &apos;deny&apos;
set policy route-map FROM-PEER rule 10 match ip address prefix-list &apos;BOGONS-V4&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For production, use Team Cymru&apos;s bogon reference or automate prefix list updates from IRR.&lt;/p&gt;
&lt;h2&gt;Debugging BGP&lt;/h2&gt;
&lt;h3&gt;Check BGP Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Summary of all BGP neighbors
show ip bgp summary

# Detailed neighbor info
show ip bgp neighbors 198.51.100.1

# Advertised routes
show ip bgp neighbors 198.51.100.1 advertised-routes

# Received routes
show ip bgp neighbors 198.51.100.1 received-routes

# Routes in BGP table
show ip bgp
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Common Issues&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;State: Idle&lt;/td&gt;
&lt;td&gt;No route to peer, firewall&lt;/td&gt;
&lt;td&gt;Check connectivity, allow TCP 179&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State: Active&lt;/td&gt;
&lt;td&gt;TCP connect failing&lt;/td&gt;
&lt;td&gt;Firewall, wrong IP, peer not configured&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State: OpenSent&lt;/td&gt;
&lt;td&gt;Waiting for peer&apos;s OPEN&lt;/td&gt;
&lt;td&gt;Peer might be filtering, AS mismatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No routes received&lt;/td&gt;
&lt;td&gt;Inbound filter too strict&lt;/td&gt;
&lt;td&gt;Check route-map, prefix-list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routes not advertised&lt;/td&gt;
&lt;td&gt;Outbound filter, route not in BGP&lt;/td&gt;
&lt;td&gt;Check network statement, route-map&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;BGP Messages&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable BGP debugging (careful in production)
debug bgp neighbor-events
debug bgp updates

# View logs
show log | match bgp
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Complete eBGP Configuration&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# === BGP Core ===
set protocols bgp system-as &apos;65000&apos;
set protocols bgp parameters router-id &apos;203.0.113.1&apos;
set protocols bgp parameters log-neighbor-changes

# === Prefix Lists ===
set policy prefix-list MY-PREFIXES rule 10 action &apos;permit&apos;
set policy prefix-list MY-PREFIXES rule 10 prefix &apos;203.0.113.0/24&apos;

set policy prefix-list BOGONS-V4 rule 10 action &apos;permit&apos;
set policy prefix-list BOGONS-V4 rule 10 prefix &apos;10.0.0.0/8&apos;
set policy prefix-list BOGONS-V4 rule 10 le &apos;32&apos;
# ... (other bogon entries)

set policy prefix-list INBOUND-FILTER rule 10 action &apos;deny&apos;
set policy prefix-list INBOUND-FILTER rule 10 prefix &apos;0.0.0.0/0&apos;
set policy prefix-list INBOUND-FILTER rule 10 ge &apos;25&apos;
set policy prefix-list INBOUND-FILTER rule 10 description &apos;Deny too-specific prefixes&apos;

set policy prefix-list INBOUND-FILTER rule 1000 action &apos;permit&apos;
set policy prefix-list INBOUND-FILTER rule 1000 prefix &apos;0.0.0.0/0&apos;
set policy prefix-list INBOUND-FILTER rule 1000 le &apos;24&apos;

# === Route Maps ===
set policy route-map TO-UPSTREAM rule 10 action &apos;permit&apos;
set policy route-map TO-UPSTREAM rule 10 match ip address prefix-list &apos;MY-PREFIXES&apos;
set policy route-map TO-UPSTREAM rule 1000 action &apos;deny&apos;

set policy route-map FROM-UPSTREAM rule 10 action &apos;deny&apos;
set policy route-map FROM-UPSTREAM rule 10 match ip address prefix-list &apos;BOGONS-V4&apos;
set policy route-map FROM-UPSTREAM rule 100 action &apos;permit&apos;
set policy route-map FROM-UPSTREAM rule 100 match ip address prefix-list &apos;INBOUND-FILTER&apos;
set policy route-map FROM-UPSTREAM rule 100 set local-preference &apos;200&apos;
set policy route-map FROM-UPSTREAM rule 1000 action &apos;deny&apos;

# === Neighbor Configuration ===
set protocols bgp neighbor 198.51.100.1 remote-as &apos;12345&apos;
set protocols bgp neighbor 198.51.100.1 description &apos;Primary Upstream&apos;
set protocols bgp neighbor 198.51.100.1 update-source &apos;203.0.113.1&apos;
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map import &apos;FROM-UPSTREAM&apos;
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map export &apos;TO-UPSTREAM&apos;
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix &apos;1000000&apos;
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast soft-reconfiguration inbound

# === Networks to Advertise ===
set protocols bgp address-family ipv4-unicast network 203.0.113.0/24
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;BGP without filters is an incident waiting to happen. Every route leak, every hijack, every accidental full-table advertisement — they all trace back to missing or inadequate filtering.&lt;/p&gt;
&lt;p&gt;The essentials:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Prefix lists&lt;/strong&gt;: Define exactly what you advertise and accept&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Route maps&lt;/strong&gt;: Apply policy consistently&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Max-prefix&lt;/strong&gt;: Protect against route leaks from peers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bogon filtering&lt;/strong&gt;: Never accept or announce garbage prefixes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Community discipline&lt;/strong&gt;: Use communities for consistent policy&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The internet runs on trust and filtering. BGP peers trust that you advertise only what you should. Filters ensure that even when mistakes happen, damage is limited.&lt;/p&gt;
&lt;p&gt;Before any BGP session goes live, ask: &quot;What&apos;s the worst that happens if I misconfigure this?&quot; Then add filters to prevent that worst case.&lt;/p&gt;
</content:encoded><category>vyos</category><category>bgp</category><category>networking</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>OSPF on VyOS: When Details Break Everything</title><link>https://ashimov.com/posts/vyos-ospf/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-ospf/</guid><description>Practical OSPF configuration on VyOS. Covers areas, passive interfaces, authentication, MTU issues, and the small details that cause OSPF adjacencies to fail silently.</description><pubDate>Fri, 31 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OSPF is deceptively simple to configure. Two routers, same area, same subnet — they should just work. And then they don&apos;t. The adjacency sticks at EXSTART, or neighbors appear and disappear, or routes mysteriously vanish.&lt;/p&gt;
&lt;p&gt;The problem is always in the details. OSPF has strict requirements that must match between neighbors: MTU, hello/dead timers, area type, authentication. Miss one, and the adjacency fails — often silently.&lt;/p&gt;
&lt;h2&gt;OSPF Fundamentals&lt;/h2&gt;
&lt;p&gt;OSPF (Open Shortest Path First) is a link-state protocol. Each router maintains a complete map of the network topology and calculates shortest paths independently.&lt;/p&gt;
&lt;p&gt;Key concepts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Area&lt;/strong&gt;: Logical grouping of routers. Area 0 is the backbone — all other areas must connect to it&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Router ID&lt;/strong&gt;: Unique identifier, usually an IP address&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adjacency&lt;/strong&gt;: Full neighbor relationship where routers exchange LSAs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LSA&lt;/strong&gt;: Link State Advertisement — the building blocks of the topology database&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Basic OSPF Configuration&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;configure

# Set router ID (use a loopback IP if you have one)
set protocols ospf parameters router-id &apos;10.255.0.1&apos;

# Enable OSPF on interfaces
set protocols ospf area 0 network &apos;10.0.0.0/24&apos;
set protocols ospf area 0 network &apos;10.0.1.0/24&apos;
set protocols ospf area 0 network &apos;10.255.0.1/32&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This enables OSPF on all interfaces matching those networks in area 0.&lt;/p&gt;
&lt;h3&gt;Interface-Based Configuration&lt;/h3&gt;
&lt;p&gt;More explicit approach — configure OSPF per interface:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set protocols ospf parameters router-id &apos;10.255.0.1&apos;

# Enable on specific interfaces
set protocols ospf interface eth0 area &apos;0&apos;
set protocols ospf interface eth1 area &apos;0&apos;
set protocols ospf interface lo area &apos;0&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Interface-based is clearer and preferred for complex setups.&lt;/p&gt;
&lt;h2&gt;Passive Interfaces: The Silent Killer&lt;/h2&gt;
&lt;p&gt;Passive interfaces don&apos;t send or receive OSPF hello packets. Use them on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LAN segments with no OSPF neighbors&lt;/li&gt;
&lt;li&gt;Internet-facing interfaces&lt;/li&gt;
&lt;li&gt;Management networks&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# Mark interface as passive
set protocols ospf passive-interface &apos;eth2&apos;
set protocols ospf passive-interface &apos;default&apos;  # All interfaces passive by default

# Then explicitly enable OSPF interfaces
set protocols ospf passive-interface-exclude &apos;eth0&apos;
set protocols ospf passive-interface-exclude &apos;eth1&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;The trap&lt;/strong&gt;: Forgetting to exclude an interface means no neighbors form. OSPF just sits there, advertising the network but never receiving hellos. No errors, no warnings — just silence.&lt;/p&gt;
&lt;h3&gt;Debugging Passive Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show ip ospf neighbor

# Empty? Check if interface is passive
show ip ospf interface eth0
# Look for &quot;Passive interface&quot; in output
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MTU Mismatch: The Classic OSPF Failure&lt;/h2&gt;
&lt;p&gt;OSPF includes MTU in Database Description packets. If MTU doesn&apos;t match between neighbors, adjacency sticks at EXSTART/EXCHANGE state.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check current MTU
show interfaces ethernet eth0

# Symptoms of MTU mismatch
show ip ospf neighbor
# Neighbor stuck in EXSTART or EXCHANGE state
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Fixing MTU Issues&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Option 1&lt;/strong&gt;: Match MTU on both sides (preferred)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set interfaces ethernet eth0 mtu &apos;1500&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Option 2&lt;/strong&gt;: Ignore MTU check (workaround)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set protocols ospf interface eth0 mtu-ignore
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use &lt;code&gt;mtu-ignore&lt;/code&gt; only when you can&apos;t control the other side&apos;s MTU. It hides the problem rather than fixing it.&lt;/p&gt;
&lt;h3&gt;Common MTU Scenarios&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Typical MTU&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard Ethernet&lt;/td&gt;
&lt;td&gt;1500&lt;/td&gt;
&lt;td&gt;Default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jumbo frames&lt;/td&gt;
&lt;td&gt;9000&lt;/td&gt;
&lt;td&gt;Must match on all devices in path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GRE tunnel&lt;/td&gt;
&lt;td&gt;1476&lt;/td&gt;
&lt;td&gt;24 bytes overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPsec tunnel&lt;/td&gt;
&lt;td&gt;1400-1438&lt;/td&gt;
&lt;td&gt;Varies by encryption&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VXLAN&lt;/td&gt;
&lt;td&gt;1450&lt;/td&gt;
&lt;td&gt;50 bytes overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Tunnel interfaces are the usual suspects. Always check MTU when OSPF over tunnels fails.&lt;/p&gt;
&lt;h2&gt;Hello and Dead Timers&lt;/h2&gt;
&lt;p&gt;OSPF sends hello packets at regular intervals. Miss too many, and the neighbor is declared dead.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hello interval&lt;/strong&gt;: How often to send hellos (default: 10 seconds)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dead interval&lt;/strong&gt;: How long to wait before declaring neighbor dead (default: 40 seconds)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;These must match between neighbors.&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check current timers
show ip ospf interface eth0

# Modify timers (both sides must match)
set protocols ospf interface eth0 hello-interval &apos;10&apos;
set protocols ospf interface eth0 dead-interval &apos;40&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Fast Failure Detection&lt;/h3&gt;
&lt;p&gt;For faster convergence, reduce timers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Aggressive timers (1 second hello, 4 second dead)
set protocols ospf interface eth0 hello-interval &apos;1&apos;
set protocols ospf interface eth0 dead-interval &apos;4&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Trade-off: Faster detection but more CPU and more sensitive to packet loss. A single dropped hello could trigger failover.&lt;/p&gt;
&lt;h3&gt;BFD for Sub-Second Failover&lt;/h3&gt;
&lt;p&gt;For true fast failover, use BFD (Bidirectional Forwarding Detection) instead of aggressive OSPF timers:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Enable BFD on interface
set protocols ospf interface eth0 bfd

# Configure BFD parameters
set protocols bfd peer 10.0.0.2 source address &apos;10.0.0.1&apos;
set protocols bfd peer 10.0.0.2 interval transmit &apos;300&apos;
set protocols bfd peer 10.0.0.2 interval receive &apos;300&apos;
set protocols bfd peer 10.0.0.2 interval multiplier &apos;3&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;BFD provides ~1 second detection without the overhead of fast OSPF hellos.&lt;/p&gt;
&lt;h2&gt;OSPF Areas&lt;/h2&gt;
&lt;p&gt;Large OSPF networks need multiple areas to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reduce SPF calculations (changes in one area don&apos;t affect others)&lt;/li&gt;
&lt;li&gt;Limit LSA flooding&lt;/li&gt;
&lt;li&gt;Summarize routes at area boundaries&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Multi-Area Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Backbone area (always area 0)
set protocols ospf interface eth0 area &apos;0&apos;

# Other areas connect through ABR (Area Border Router)
set protocols ospf interface eth1 area &apos;1&apos;
set protocols ospf interface eth2 area &apos;2&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The router with interfaces in multiple areas is an ABR (Area Border Router).&lt;/p&gt;
&lt;h3&gt;Stub Areas&lt;/h3&gt;
&lt;p&gt;Stub areas don&apos;t receive external routes (Type 5 LSAs). Useful for areas that only need a default route to the rest of the network:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Configure area as stub
set protocols ospf area 1 area-type stub

# On ABR, optionally set default route cost
set protocols ospf area 1 area-type stub default-cost &apos;10&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All routers in the area must agree on stub configuration.&lt;/p&gt;
&lt;h3&gt;Totally Stubby Areas&lt;/h3&gt;
&lt;p&gt;Block both external routes AND inter-area routes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On ABR only
set protocols ospf area 1 area-type stub no-summary
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Routers in the area only see a default route. Simplest routing table, least flexibility.&lt;/p&gt;
&lt;h3&gt;NSSA (Not-So-Stubby Area)&lt;/h3&gt;
&lt;p&gt;Like stub, but allows local external routes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set protocols ospf area 1 area-type nssa
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Useful when the area has an ASBR (redistributing from another protocol) but you don&apos;t want external routes from other areas.&lt;/p&gt;
&lt;h2&gt;OSPF Authentication&lt;/h2&gt;
&lt;h3&gt;MD5 Authentication (Recommended)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Set authentication for interface
set protocols ospf interface eth0 authentication md5 key-id 1 md5-key &apos;YourSecretKey123&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both neighbors must have identical key-id and key.&lt;/p&gt;
&lt;h3&gt;Rotating Keys&lt;/h3&gt;
&lt;p&gt;OSPF supports multiple keys for hitless rotation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Add new key
set protocols ospf interface eth0 authentication md5 key-id 2 md5-key &apos;NewSecretKey456&apos;

# Both keys active — neighbors using either key will authenticate
# After all neighbors updated, remove old key
delete protocols ospf interface eth0 authentication md5 key-id 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Plain Text Authentication (Don&apos;t Use)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Exists but insecure — anyone can sniff the password
set protocols ospf interface eth0 authentication plaintext-password &apos;visible-password&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use MD5 or no authentication. Plain text is false security.&lt;/p&gt;
&lt;h2&gt;Network Types&lt;/h2&gt;
&lt;p&gt;OSPF behavior changes based on network type:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;DR/BDR&lt;/th&gt;
&lt;th&gt;Multicast&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;broadcast&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Ethernet, default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;point-to-point&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;P2P links, tunnels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;point-to-multipoint&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;NBMA with full connectivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;non-broadcast&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Frame Relay (legacy)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Point-to-Point Links&lt;/h3&gt;
&lt;p&gt;For direct router-to-router links, use point-to-point:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set protocols ospf interface eth0 network &apos;point-to-point&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No DR/BDR election delay&lt;/li&gt;
&lt;li&gt;Faster adjacency formation&lt;/li&gt;
&lt;li&gt;Works over unnumbered interfaces&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use for: GRE tunnels, VTI interfaces, WireGuard tunnels, direct fiber links.&lt;/p&gt;
&lt;h2&gt;Route Redistribution&lt;/h2&gt;
&lt;p&gt;Import routes from other sources into OSPF:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Redistribute connected routes
set protocols ospf redistribute connected

# Redistribute static routes
set protocols ospf redistribute static

# Redistribute with metric
set protocols ospf redistribute connected metric &apos;100&apos;
set protocols ospf redistribute connected metric-type &apos;2&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Metric types&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Type 1 (E1): External metric added to internal path cost&lt;/li&gt;
&lt;li&gt;Type 2 (E2): External metric only, internal cost ignored (default)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Filtering Redistributed Routes&lt;/h3&gt;
&lt;p&gt;Use route-maps to control what gets redistributed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Define prefix list
set policy prefix-list OSPF-EXPORT rule 10 action &apos;permit&apos;
set policy prefix-list OSPF-EXPORT rule 10 prefix &apos;10.10.0.0/16&apos;
set policy prefix-list OSPF-EXPORT rule 10 le &apos;24&apos;

# Define route-map
set policy route-map OSPF-REDISTRIBUTE rule 10 action &apos;permit&apos;
set policy route-map OSPF-REDISTRIBUTE rule 10 match ip address prefix-list &apos;OSPF-EXPORT&apos;
set policy route-map OSPF-REDISTRIBUTE rule 10 set metric &apos;50&apos;

# Apply to redistribution
set protocols ospf redistribute connected route-map &apos;OSPF-REDISTRIBUTE&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting OSPF&lt;/h2&gt;
&lt;h3&gt;Check Neighbor State&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show ip ospf neighbor

# Expected: FULL state for all neighbors
# Problem states:
# - INIT: Receiving hellos, but they don&apos;t see us
# - 2-WAY: Seen each other, waiting for DR election (normal on broadcast)
# - EXSTART/EXCHANGE: Database sync starting (often MTU mismatch)
# - LOADING: Receiving LSAs
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Interface Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show ip ospf interface eth0

# Verify:
# - Correct area
# - Hello/Dead intervals match
# - Not passive when shouldn&apos;t be
# - Network type appropriate
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check OSPF Database&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show all LSAs
show ip ospf database

# Show specific LSA type
show ip ospf database router
show ip ospf database network
show ip ospf database external
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Routes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# OSPF routes
show ip route ospf

# Why isn&apos;t a route showing?
# 1. LSA not received (neighbor issue)
# 2. Better route exists
# 3. Filtering applied
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Common Problems and Solutions&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No neighbors&lt;/td&gt;
&lt;td&gt;Passive interface, ACL blocking&lt;/td&gt;
&lt;td&gt;Check passive config, firewall rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stuck at INIT&lt;/td&gt;
&lt;td&gt;One-way communication&lt;/td&gt;
&lt;td&gt;Check firewall, routing back to us&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stuck at EXSTART&lt;/td&gt;
&lt;td&gt;MTU mismatch&lt;/td&gt;
&lt;td&gt;Match MTU or use mtu-ignore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neighbors flapping&lt;/td&gt;
&lt;td&gt;Timer mismatch, unstable link&lt;/td&gt;
&lt;td&gt;Match timers, check link quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routes missing&lt;/td&gt;
&lt;td&gt;Area mismatch, summarization&lt;/td&gt;
&lt;td&gt;Verify area config, check ABR&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Complete OSPF Configuration&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# === OSPF Core ===
set protocols ospf parameters router-id &apos;10.255.0.1&apos;
set protocols ospf log-adjacency-changes

# === Interfaces ===
set protocols ospf interface eth0 area &apos;0&apos;
set protocols ospf interface eth0 network &apos;point-to-point&apos;
set protocols ospf interface eth0 authentication md5 key-id 1 md5-key &apos;SecureKey123&apos;
set protocols ospf interface eth0 bfd

set protocols ospf interface eth1 area &apos;0&apos;
set protocols ospf interface eth1 network &apos;broadcast&apos;
set protocols ospf interface eth1 priority &apos;100&apos;

# === Passive Interfaces ===
set protocols ospf passive-interface &apos;eth2&apos;
set protocols ospf passive-interface &apos;lo&apos;

# === Area Configuration ===
set protocols ospf area 1 area-type stub

# === Redistribution ===
set protocols ospf redistribute connected metric &apos;100&apos;
set protocols ospf redistribute connected route-map &apos;OSPF-EXPORT&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;OSPF fails on details:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MTU&lt;/strong&gt;: Must match. When adjacency sticks at EXSTART, check MTU first.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Timers&lt;/strong&gt;: Hello and dead intervals must be identical. Mismatched timers = no adjacency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Passive interfaces&lt;/strong&gt;: A passive interface that should be active produces no errors — just silence.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Both sides need identical keys and key-ids.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Network type&lt;/strong&gt;: Point-to-point for tunnels and direct links. Broadcast for Ethernet LANs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The pattern: OSPF is strict about requirements but quiet about failures. When something doesn&apos;t work, methodically check each parameter. The problem is always a mismatch somewhere.&lt;/p&gt;
&lt;p&gt;Debug OSPF by elimination: Can you ping the neighbor? Is the interface passive? Does MTU match? Do timers match? Is authentication correct? Work through the list, and you&apos;ll find it.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>ospf</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Observability on VyOS: Logs, Metrics, and Backups That Matter</title><link>https://ashimov.com/posts/vyos-observability/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-observability/</guid><description>Setting up proper logging, monitoring, and backup strategies for VyOS. What to log, where to send it, how to back up configurations, and why a router without logs is like production without monitoring.</description><pubDate>Tue, 28 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Your router is infrastructure. It deserves the same observability as any production system. When something breaks at 3 AM, &quot;I don&apos;t know what happened&quot; isn&apos;t an acceptable answer. Logs, metrics, and configuration backups turn mysterious failures into diagnosable incidents.&lt;/p&gt;
&lt;p&gt;This guide covers practical observability for VyOS — what to capture, where to store it, and how to use it when things go wrong.&lt;/p&gt;
&lt;h2&gt;The Logging Strategy&lt;/h2&gt;
&lt;h3&gt;What to Log&lt;/h3&gt;
&lt;p&gt;Not all logs are equal. High-value logs for troubleshooting:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Log Type&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Firewall drops&lt;/td&gt;
&lt;td&gt;Shows blocked traffic, attack attempts, misconfigurations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interface state changes&lt;/td&gt;
&lt;td&gt;Link up/down events, carrier changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BGP/routing changes&lt;/td&gt;
&lt;td&gt;Route flaps, peer state changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Authentication&lt;/td&gt;
&lt;td&gt;SSH login attempts, successful and failed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration changes&lt;/td&gt;
&lt;td&gt;Who changed what, when&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System errors&lt;/td&gt;
&lt;td&gt;Kernel messages, service failures&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Basic Logging Setup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable system logging
set system syslog global facility all level &apos;info&apos;
set system syslog global facility protocols level &apos;debug&apos;

# Log to local file
set system syslog file messages facility all level &apos;notice&apos;
set system syslog file auth facility auth level &apos;info&apos;
set system syslog file firewall facility all level &apos;debug&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Remote Logging (Recommended)&lt;/h3&gt;
&lt;p&gt;Local logs disappear when the router dies. Send logs to a remote syslog server:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Remote syslog server
set system syslog host 10.0.0.100 facility all level &apos;info&apos;
set system syslog host 10.0.0.100 protocol &apos;udp&apos;
set system syslog host 10.0.0.100 port &apos;514&apos;

# For TLS-encrypted syslog (rsyslog with TLS)
set system syslog host logs.example.com protocol &apos;tcp&apos;
set system syslog host logs.example.com port &apos;6514&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Popular syslog receivers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;rsyslog&lt;/strong&gt;: Standard Linux syslog daemon&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graylog&lt;/strong&gt;: Full log management platform&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loki&lt;/strong&gt;: Lightweight, Prometheus-style logs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector&lt;/strong&gt;: Modern log aggregation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Firewall Logging&lt;/h3&gt;
&lt;p&gt;Firewall logs are crucial. Log all drops, and selectively log accepts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Log dropped packets
set firewall ipv4 name WAN-TO-LAN default-action &apos;drop&apos;
set firewall ipv4 name WAN-TO-LAN default-log

# Log specific rules
set firewall ipv4 name WAN-TO-LAN rule 100 action &apos;drop&apos;
set firewall ipv4 name WAN-TO-LAN rule 100 log
set firewall ipv4 name WAN-TO-LAN rule 100 description &apos;Log and drop invalid&apos;
set firewall ipv4 name WAN-TO-LAN rule 100 state &apos;invalid&apos;

# Log successful SSH (for audit)
set firewall ipv4 name LAN-LOCAL rule 50 log
set firewall ipv4 name LAN-LOCAL rule 50 description &apos;Log SSH access&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;: Don&apos;t log everything at accept. High-traffic rules logging can overwhelm storage and CPU.&lt;/p&gt;
&lt;h3&gt;Reading Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Recent logs
show log

# Filtered logs
show log | match firewall
show log | match -i error

# Tail logs in real-time
monitor log

# Specific log file
show log file firewall
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Metrics and Monitoring&lt;/h2&gt;
&lt;h3&gt;SNMP for Traditional Monitoring&lt;/h3&gt;
&lt;p&gt;If you have Zabbix, PRTG, LibreNMS, or similar:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# SNMP v2c (simple but less secure)
set service snmp community public authorization &apos;ro&apos;
set service snmp community public network &apos;10.0.0.0/24&apos;
set service snmp listen-address 10.0.0.1

# SNMP v3 (recommended)
set service snmp v3 user monitor auth encrypted-password &apos;authpassword&apos;
set service snmp v3 user monitor auth type &apos;sha&apos;
set service snmp v3 user monitor privacy encrypted-password &apos;privpassword&apos;
set service snmp v3 user monitor privacy type &apos;aes&apos;
set service snmp v3 user monitor group &apos;monitor-group&apos;

set service snmp v3 group monitor-group mode &apos;ro&apos;
set service snmp v3 group monitor-group view &apos;monitor-view&apos;
set service snmp v3 view monitor-view oid &apos;.1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prometheus/Node Exporter&lt;/h3&gt;
&lt;p&gt;For modern monitoring stacks:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# VyOS doesn&apos;t have native Prometheus exporter, but you can:
# 1. Install node_exporter via container
# 2. Use SNMP exporter with Prometheus
# 3. Script custom metrics export

# Example: expose metrics via simple script
# Create /config/scripts/metrics.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A simple metrics approach:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/metrics.sh - run via cron or http server

echo &quot;# HELP vyos_interface_rx_bytes Interface received bytes&quot;
echo &quot;# TYPE vyos_interface_rx_bytes counter&quot;
for iface in eth0 eth1 eth2; do
    rx=$(cat /sys/class/net/$iface/statistics/rx_bytes 2&amp;gt;/dev/null || echo 0)
    echo &quot;vyos_interface_rx_bytes{interface=\&quot;$iface\&quot;} $rx&quot;
done

echo &quot;# HELP vyos_firewall_dropped Firewall dropped packets&quot;
echo &quot;# TYPE vyos_firewall_dropped counter&quot;
# Parse from iptables -L -v -n
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Health Checks&lt;/h3&gt;
&lt;p&gt;Monitor critical functions:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Interface status
show interfaces

# Routing table
show ip route

# Firewall counters
show firewall

# VPN status
show vpn ipsec sa
show wireguard peers

# System resources
show system memory
show system cpu
show system storage
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Automate these checks and alert on anomalies.&lt;/p&gt;
&lt;h2&gt;Configuration Backup&lt;/h2&gt;
&lt;h3&gt;Manual Backup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Show configuration (can be piped to file)
show configuration commands

# Save to file
show configuration commands &amp;gt; /config/backup/config-$(date +%Y%m%d).txt

# Compare configuration files
diff /config/backup/config-old.txt /config/backup/config-new.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Automated Backup Script&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/backup-config.sh

BACKUP_DIR=&quot;/config/backup&quot;
DATE=$(date +%Y%m%d-%H%M)
KEEP_DAYS=30

# Create backup
/opt/vyatta/bin/vyatta-op-cmd-wrapper show configuration commands &amp;gt; &quot;$BACKUP_DIR/vyos-config-$DATE.txt&quot;

# Clean old backups
find &quot;$BACKUP_DIR&quot; -name &quot;vyos-config-*.txt&quot; -mtime +$KEEP_DAYS -delete

# Optional: copy to remote server
# scp &quot;$BACKUP_DIR/vyos-config-$DATE.txt&quot; backup@server:/backups/vyos/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set system task-scheduler task backup-config executable path &apos;/config/scripts/backup-config.sh&apos;
set system task-scheduler task backup-config interval &apos;1d&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Remote Backup&lt;/h3&gt;
&lt;p&gt;Send backups off-device:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# Backup to remote server via SCP

REMOTE_USER=&quot;backup&quot;
REMOTE_HOST=&quot;10.0.0.100&quot;
REMOTE_PATH=&quot;/backups/vyos&quot;

CONFIG_FILE=&quot;/tmp/vyos-config-$(date +%Y%m%d).txt&quot;

# Generate config
/opt/vyatta/bin/vyatta-op-cmd-wrapper show configuration commands &amp;gt; &quot;$CONFIG_FILE&quot;

# Send to remote
scp -i /config/auth/backup_key &quot;$CONFIG_FILE&quot; &quot;${REMOTE_USER}@${REMOTE_HOST}:${REMOTE_PATH}/&quot;

# Cleanup
rm &quot;$CONFIG_FILE&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Git-Based Configuration Management&lt;/h3&gt;
&lt;p&gt;For infrastructure-as-code approach:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/git-backup.sh

cd /config
git add -A
git commit -m &quot;Config backup $(date +%Y%m%d-%H%M)&quot;
git push origin main
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Initialize git in /config:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cd /config
git init
git remote add origin git@github.com:yourorg/vyos-config.git
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Full version history&lt;/li&gt;
&lt;li&gt;Diff between any versions&lt;/li&gt;
&lt;li&gt;Blame to see who changed what&lt;/li&gt;
&lt;li&gt;Rollback to any point&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Configuration Diff&lt;/h2&gt;
&lt;p&gt;Always diff before committing changes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Make some changes
set interfaces ethernet eth0 description &apos;NEW-WAN&apos;

# See what would change
compare

# Discard if wrong
discard

# Or commit if correct
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For historical comparison:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Compare running config with saved boot config (in configure mode)
configure
compare saved
exit

# Compare two backup files
diff /config/backup/config-old.txt /config/backup/config-new.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Alerting&lt;/h2&gt;
&lt;h3&gt;Simple Email Alerts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/alert.sh

SUBJECT=&quot;$1&quot;
MESSAGE=&quot;$2&quot;
RECIPIENT=&quot;admin@example.com&quot;

echo &quot;$MESSAGE&quot; | mail -s &quot;$SUBJECT&quot; &quot;$RECIPIENT&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Integrate with monitoring:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/wan-monitor.sh

if ! ping -c 3 -W 5 8.8.8.8 &amp;gt; /dev/null 2&amp;gt;&amp;amp;1; then
    /config/scripts/alert.sh &quot;WAN DOWN&quot; &quot;Primary WAN unreachable at $(date)&quot;
fi
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Webhook Alerts (Slack, Discord, PagerDuty)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/webhook-alert.sh

WEBHOOK_URL=&quot;https://hooks.slack.com/services/xxx&quot;
MESSAGE=&quot;$1&quot;

curl -X POST -H &apos;Content-type: application/json&apos; \
    --data &quot;{\&quot;text\&quot;:\&quot;$MESSAGE\&quot;}&quot; \
    &quot;$WEBHOOK_URL&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What to Monitor&lt;/h2&gt;
&lt;p&gt;Essential metrics for router health:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Warning Threshold&lt;/th&gt;
&lt;th&gt;Critical Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU usage&lt;/td&gt;
&lt;td&gt;&amp;gt; 70%&lt;/td&gt;
&lt;td&gt;&amp;gt; 90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;&amp;gt; 80%&lt;/td&gt;
&lt;td&gt;&amp;gt; 95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interface errors&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.1%&lt;/td&gt;
&lt;td&gt;&amp;gt; 1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Firewall drops/sec&lt;/td&gt;
&lt;td&gt;Depends on baseline&lt;/td&gt;
&lt;td&gt;Sudden 10x increase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BGP peer state&lt;/td&gt;
&lt;td&gt;Any change&lt;/td&gt;
&lt;td&gt;Down&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VPN tunnel state&lt;/td&gt;
&lt;td&gt;Flapping&lt;/td&gt;
&lt;td&gt;Down&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk usage&lt;/td&gt;
&lt;td&gt;&amp;gt; 80%&lt;/td&gt;
&lt;td&gt;&amp;gt; 95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config changes&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Unexpected&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Disaster Recovery Checklist&lt;/h2&gt;
&lt;p&gt;When everything fails, you need:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Configuration backup&lt;/strong&gt; (tested restore)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Firmware/image backup&lt;/strong&gt; (same VyOS version)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documented procedure&lt;/strong&gt; (how to restore)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Out-of-band access&lt;/strong&gt; (console, IPMI, if available)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Test Your Backups&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Periodically test restore
# On a test instance:
configure
load /config/backup/vyos-config-latest.txt
compare
# Review changes
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A backup you&apos;ve never tested restoring is not a backup.&lt;/p&gt;
&lt;h2&gt;Complete Observability Setup&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# === Syslog ===
set system syslog global facility all level &apos;info&apos;
set system syslog host 10.0.0.100 facility all level &apos;info&apos;
set system syslog host 10.0.0.100 protocol &apos;udp&apos;
set system syslog file messages facility all level &apos;notice&apos;

# === SNMP ===
set service snmp community monitoring authorization &apos;ro&apos;
set service snmp community monitoring network &apos;10.0.0.0/24&apos;
set service snmp listen-address 10.0.0.1
set service snmp location &apos;Network Closet&apos;
set service snmp contact &apos;admin@example.com&apos;

# === Firewall Logging ===
set firewall ipv4 name WAN-TO-LAN default-log
set firewall ipv4 name WAN-LOCAL default-log

# === Scheduled Tasks ===
set system task-scheduler task backup-config executable path &apos;/config/scripts/backup-config.sh&apos;
set system task-scheduler task backup-config interval &apos;1d&apos;
set system task-scheduler task wan-monitor executable path &apos;/config/scripts/wan-monitor.sh&apos;
set system task-scheduler task wan-monitor interval &apos;5m&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;A router without observability is like running production without monitoring — you&apos;ll only know something&apos;s wrong when users complain, and you&apos;ll have no data to diagnose it.&lt;/p&gt;
&lt;p&gt;The minimum viable observability:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Remote syslog&lt;/strong&gt;: Logs survive device failure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Firewall logging&lt;/strong&gt;: See what&apos;s being blocked&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configuration backups&lt;/strong&gt;: Automated, tested, off-device&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Health monitoring&lt;/strong&gt;: Alert before users notice&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Everything else builds on this foundation. Start simple, add complexity as needed. The goal isn&apos;t comprehensive monitoring — it&apos;s having the data you need when things break.&lt;/p&gt;
</content:encoded><category>vyos</category><category>backup</category><category>monitoring</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>QoS on VyOS: Making Latency Feel Better</title><link>https://ashimov.com/posts/vyos-qos/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-qos/</guid><description>Practical traffic shaping and QoS configuration on VyOS. Covers queue disciplines, traffic prioritization, fighting bufferbloat, and understanding where the actual bottleneck is.</description><pubDate>Fri, 24 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;QoS (Quality of Service) is often misunderstood. People expect it to &quot;make the internet faster.&quot; It doesn&apos;t. QoS is about managing scarcity — when there&apos;s not enough bandwidth for everyone, QoS decides who gets priority.&lt;/p&gt;
&lt;p&gt;The key insight: &lt;strong&gt;QoS only works when you control the bottleneck&lt;/strong&gt;. If your ISP is the bottleneck, traffic shaping on your router shapes what leaves your network, not what your ISP does. Understanding this is crucial for effective QoS.&lt;/p&gt;
&lt;h2&gt;Understanding the Problem: Bufferbloat&lt;/h2&gt;
&lt;p&gt;Modern networks have a hidden enemy: bufferbloat. Network devices have large buffers that queue packets when congested. Large buffers = high latency during congestion.&lt;/p&gt;
&lt;p&gt;Scenario: You&apos;re on a video call. Someone starts a large download. Suddenly your call has 500ms latency because packets are stuck in buffers behind download packets.&lt;/p&gt;
&lt;p&gt;QoS solves this by:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Shaping traffic below the actual link speed (to control where queuing happens)&lt;/li&gt;
&lt;li&gt;Prioritizing latency-sensitive traffic&lt;/li&gt;
&lt;li&gt;Using smart queue disciplines that prevent buffer buildup&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Measuring the Problem&lt;/h2&gt;
&lt;p&gt;Before configuring QoS, measure your baseline:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Test bufferbloat (from client, while running a speed test)
ping 8.8.8.8

# Watch for latency increase during upload/download
# Normal: ~20ms, Bufferbloat: 200-1000ms
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or use the &lt;a href=&quot;https://www.waveform.com/tools/bufferbloat&quot;&gt;Bufferbloat test&lt;/a&gt; — it specifically measures latency under load.&lt;/p&gt;
&lt;h2&gt;VyOS Traffic Shaping Basics&lt;/h2&gt;
&lt;p&gt;VyOS uses Linux tc (traffic control) under the hood. Two main components:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Shaper&lt;/strong&gt;: Limits overall bandwidth
&lt;strong&gt;Classes/Queues&lt;/strong&gt;: Divide bandwidth among traffic types&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Basic traffic shaping on WAN interface
set traffic-policy shaper WAN-OUT bandwidth &apos;95mbit&apos;
set traffic-policy shaper WAN-OUT default bandwidth &apos;50%&apos;
set traffic-policy shaper WAN-OUT default ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT default queue-type &apos;fq-codel&apos;

# Apply to outbound on WAN interface
set interfaces ethernet eth0 traffic-policy out &apos;WAN-OUT&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;bandwidth &apos;95mbit&apos;&lt;/strong&gt;: Total shaper bandwidth. Set this ~95% of your actual upload speed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fq-codel&lt;/strong&gt;: Fair Queue with Controlled Delay — fights bufferbloat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ceiling &apos;100%&apos;&lt;/strong&gt;: Can burst to full bandwidth if available&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Why 95% of Your Actual Speed?&lt;/h2&gt;
&lt;p&gt;If your upload is 100Mbps and you shape at 100Mbps, congestion still happens at your ISP&apos;s edge. You&apos;re not controlling the bottleneck.&lt;/p&gt;
&lt;p&gt;Shape at 95Mbps (or even 90Mbps for very stable latency), and congestion happens at your router, where you control the queue. Your router&apos;s smart queue (fq-codel) manages latency instead of your ISP&apos;s dumb FIFO buffer.&lt;/p&gt;
&lt;h2&gt;Traffic Classes: Prioritizing Different Traffic&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create shaper with classes
set traffic-policy shaper WAN-OUT bandwidth &apos;95mbit&apos;

# Voice/Video - highest priority, guaranteed bandwidth
set traffic-policy shaper WAN-OUT class 10 bandwidth &apos;20%&apos;
set traffic-policy shaper WAN-OUT class 10 ceiling &apos;50%&apos;
set traffic-policy shaper WAN-OUT class 10 priority &apos;0&apos;
set traffic-policy shaper WAN-OUT class 10 queue-type &apos;fq-codel&apos;
set traffic-policy shaper WAN-OUT class 10 match VOIP ip dscp &apos;ef&apos;

# Interactive (SSH, gaming) - high priority
set traffic-policy shaper WAN-OUT class 20 bandwidth &apos;10%&apos;
set traffic-policy shaper WAN-OUT class 20 ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT class 20 priority &apos;1&apos;
set traffic-policy shaper WAN-OUT class 20 queue-type &apos;fq-codel&apos;
set traffic-policy shaper WAN-OUT class 20 match SSH ip protocol &apos;tcp&apos;
set traffic-policy shaper WAN-OUT class 20 match SSH ip destination port &apos;22&apos;

# Web browsing - normal priority
set traffic-policy shaper WAN-OUT class 30 bandwidth &apos;30%&apos;
set traffic-policy shaper WAN-OUT class 30 ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT class 30 priority &apos;3&apos;
set traffic-policy shaper WAN-OUT class 30 queue-type &apos;fq-codel&apos;
set traffic-policy shaper WAN-OUT class 30 match HTTP ip protocol &apos;tcp&apos;
set traffic-policy shaper WAN-OUT class 30 match HTTP ip destination port &apos;80,443&apos;

# Bulk downloads - lowest priority
set traffic-policy shaper WAN-OUT class 40 bandwidth &apos;20%&apos;
set traffic-policy shaper WAN-OUT class 40 ceiling &apos;90%&apos;
set traffic-policy shaper WAN-OUT class 40 priority &apos;5&apos;
set traffic-policy shaper WAN-OUT class 40 queue-type &apos;fq-codel&apos;

# Default for unclassified traffic
set traffic-policy shaper WAN-OUT default bandwidth &apos;20%&apos;
set traffic-policy shaper WAN-OUT default ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT default priority &apos;4&apos;
set traffic-policy shaper WAN-OUT default queue-type &apos;fq-codel&apos;

set interfaces ethernet eth0 traffic-policy out &apos;WAN-OUT&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Understanding the Parameters&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;bandwidth&lt;/strong&gt;: Guaranteed minimum bandwidth for this class
&lt;strong&gt;ceiling&lt;/strong&gt;: Maximum bandwidth when other classes aren&apos;t using theirs
&lt;strong&gt;priority&lt;/strong&gt;: Lower number = higher priority (0 is highest)
&lt;strong&gt;queue-type&lt;/strong&gt;: Algorithm for managing the queue&lt;/p&gt;
&lt;p&gt;Bandwidth percentages should roughly add up to 100%. The ceiling allows classes to borrow unused bandwidth.&lt;/p&gt;
&lt;h2&gt;Queue Types: fq-codel vs Others&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;fq-codel (Fair Queue Controlled Delay)&lt;/strong&gt;: Best for most cases. Maintains low latency, fair sharing between flows. Use this unless you have specific needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;sfq (Stochastic Fair Queue)&lt;/strong&gt;: Simpler, less effective at latency control. Legacy option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;pfifo/bfifo&lt;/strong&gt;: Simple FIFO queues. Don&apos;t fight bufferbloat. Avoid.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;cake&lt;/strong&gt;: Advanced shaper (may need additional packages). Even better than fq-codel for some scenarios.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# If cake is available
set traffic-policy shaper WAN-OUT default queue-type &apos;cake&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Inbound Shaping: The Hard Problem&lt;/h2&gt;
&lt;p&gt;You can&apos;t directly control inbound traffic — it&apos;s already at your doorstep. But you can:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Police incoming traffic&lt;/strong&gt;: Drop/mark packets exceeding rate&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use ingress shaping&lt;/strong&gt;: Shape traffic after it arrives&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rely on TCP feedback&lt;/strong&gt;: Shaping outbound ACKs affects inbound TCP rate&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;configure

# Ingress policing on WAN interface
set traffic-policy limiter WAN-IN class 10 bandwidth &apos;95mbit&apos;
set traffic-policy limiter WAN-IN class 10 match ALL ip source address &apos;0.0.0.0/0&apos;
set traffic-policy limiter WAN-IN default bandwidth &apos;95mbit&apos;

set interfaces ethernet eth0 traffic-policy in &apos;WAN-IN&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is less precise than outbound shaping. For better download QoS, shape slightly below your download speed and let fq-codel manage queuing.&lt;/p&gt;
&lt;h2&gt;Practical Examples&lt;/h2&gt;
&lt;h3&gt;Home Office: Prioritize Video Calls&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Identify video call traffic (Zoom, Teams, etc use UDP on various ports)
set traffic-policy shaper WAN-OUT class 10 match VIDEO-UDP ip protocol &apos;udp&apos;
set traffic-policy shaper WAN-OUT class 10 match VIDEO-UDP ip destination port &apos;3478-3481,8801-8810,19302-19309&apos;

# Give video 30% guaranteed, can burst to 60%
set traffic-policy shaper WAN-OUT class 10 bandwidth &apos;30%&apos;
set traffic-policy shaper WAN-OUT class 10 ceiling &apos;60%&apos;
set traffic-policy shaper WAN-OUT class 10 priority &apos;0&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Gaming: Low Latency&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Gaming often uses specific ports (varies by game)
set traffic-policy shaper WAN-OUT class 15 match GAMING ip protocol &apos;udp&apos;
set traffic-policy shaper WAN-OUT class 15 match GAMING ip source port &apos;1024-65535&apos;
set traffic-policy shaper WAN-OUT class 15 bandwidth &apos;15%&apos;
set traffic-policy shaper WAN-OUT class 15 ceiling &apos;50%&apos;
set traffic-policy shaper WAN-OUT class 15 priority &apos;0&apos;

# Also prioritize small packets (often game updates)
set traffic-policy shaper WAN-OUT class 15 match SMALL ip ip-length &apos;&amp;lt;256&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Torrent/Backup Deprioritization&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Bulk traffic class - low priority
set traffic-policy shaper WAN-OUT class 50 bandwidth &apos;10%&apos;
set traffic-policy shaper WAN-OUT class 50 ceiling &apos;80%&apos;
set traffic-policy shaper WAN-OUT class 50 priority &apos;7&apos;

# Match by ports commonly used by bulk transfers
set traffic-policy shaper WAN-OUT class 50 match TORRENT ip destination port &apos;6881-6889&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;DSCP Marking&lt;/h2&gt;
&lt;p&gt;DSCP (Differentiated Services Code Point) is a field in IP header used to classify traffic. Many applications set DSCP; you can use it for classification:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Match on DSCP values set by applications
set traffic-policy shaper WAN-OUT class 10 match VOICE ip dscp &apos;ef&apos;
set traffic-policy shaper WAN-OUT class 20 match VIDEO ip dscp &apos;af41&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Common DSCP values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;EF (46)&lt;/strong&gt;: Expedited Forwarding - voice&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AF41 (34)&lt;/strong&gt;: Assured Forwarding - video&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AF21 (18)&lt;/strong&gt;: Assured Forwarding - business critical&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CS1 (8)&lt;/strong&gt;: Scavenger - bulk/background&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can also mark traffic yourself:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Mark VoIP traffic with EF
set firewall ipv4 name MARK-QOS rule 10 action &apos;accept&apos;
set firewall ipv4 name MARK-QOS rule 10 protocol &apos;udp&apos;
set firewall ipv4 name MARK-QOS rule 10 destination port &apos;5060&apos;
set firewall ipv4 name MARK-QOS rule 10 set dscp &apos;ef&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring QoS&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# Show current traffic policy statistics
show queueing interface eth0

# Show class statistics
tc -s class show dev eth0

# Watch queue lengths
watch tc -s qdisc show dev eth0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;drops&lt;/strong&gt;: Some drops are normal (fq-codel drops to signal congestion)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;backlogs&lt;/strong&gt;: Should be low, high backlog = buffer building up&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;overlimits&lt;/strong&gt;: Traffic exceeding class bandwidth (borrowing from ceiling)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Debugging QoS Issues&lt;/h2&gt;
&lt;h3&gt;Traffic Not Being Classified&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Most traffic in default class? Check your matches
show queueing interface eth0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If priority traffic isn&apos;t getting classified, verify:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Port/protocol matches are correct&lt;/li&gt;
&lt;li&gt;Traffic isn&apos;t using unexpected ports (HTTPS multiplexes everything over 443)&lt;/li&gt;
&lt;li&gt;Match rules are specific enough&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Latency Still High&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Shaper bandwidth too high&lt;/strong&gt;: Lower it (try 90% of link speed)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Not using fq-codel&lt;/strong&gt;: Change queue-type&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inbound is the problem&lt;/strong&gt;: Need ingress shaping too&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ISP QoS&lt;/strong&gt;: Your ISP might have their own queuing&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;VoIP Quality Still Poor&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Jitter buffer&lt;/strong&gt;: Some jitter is handled by endpoints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Packet loss&lt;/strong&gt;: Check &lt;code&gt;show queueing&lt;/code&gt; for excessive drops&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Misclassified traffic&lt;/strong&gt;: Verify VoIP is hitting the right class&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Complete QoS Configuration&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# === Traffic Policy ===
set traffic-policy shaper WAN-OUT bandwidth &apos;95mbit&apos;

# Voice/Video - highest priority
set traffic-policy shaper WAN-OUT class 10 bandwidth &apos;20%&apos;
set traffic-policy shaper WAN-OUT class 10 ceiling &apos;50%&apos;
set traffic-policy shaper WAN-OUT class 10 priority &apos;0&apos;
set traffic-policy shaper WAN-OUT class 10 queue-type &apos;fq-codel&apos;
set traffic-policy shaper WAN-OUT class 10 match VOIP ip dscp &apos;ef&apos;
set traffic-policy shaper WAN-OUT class 10 match REALTIME ip protocol &apos;udp&apos;
set traffic-policy shaper WAN-OUT class 10 match REALTIME ip destination port &apos;3478-3481,5060,16384-32767&apos;

# Interactive
set traffic-policy shaper WAN-OUT class 20 bandwidth &apos;15%&apos;
set traffic-policy shaper WAN-OUT class 20 ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT class 20 priority &apos;1&apos;
set traffic-policy shaper WAN-OUT class 20 queue-type &apos;fq-codel&apos;
set traffic-policy shaper WAN-OUT class 20 match SSH ip protocol &apos;tcp&apos;
set traffic-policy shaper WAN-OUT class 20 match SSH ip destination port &apos;22&apos;
set traffic-policy shaper WAN-OUT class 20 match DNS ip protocol &apos;udp&apos;
set traffic-policy shaper WAN-OUT class 20 match DNS ip destination port &apos;53&apos;

# Web
set traffic-policy shaper WAN-OUT class 30 bandwidth &apos;35%&apos;
set traffic-policy shaper WAN-OUT class 30 ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT class 30 priority &apos;3&apos;
set traffic-policy shaper WAN-OUT class 30 queue-type &apos;fq-codel&apos;
set traffic-policy shaper WAN-OUT class 30 match WEB ip protocol &apos;tcp&apos;
set traffic-policy shaper WAN-OUT class 30 match WEB ip destination port &apos;80,443&apos;

# Bulk
set traffic-policy shaper WAN-OUT class 40 bandwidth &apos;10%&apos;
set traffic-policy shaper WAN-OUT class 40 ceiling &apos;80%&apos;
set traffic-policy shaper WAN-OUT class 40 priority &apos;6&apos;
set traffic-policy shaper WAN-OUT class 40 queue-type &apos;fq-codel&apos;

# Default
set traffic-policy shaper WAN-OUT default bandwidth &apos;20%&apos;
set traffic-policy shaper WAN-OUT default ceiling &apos;100%&apos;
set traffic-policy shaper WAN-OUT default priority &apos;4&apos;
set traffic-policy shaper WAN-OUT default queue-type &apos;fq-codel&apos;

# === Apply ===
set interfaces ethernet eth0 traffic-policy out &apos;WAN-OUT&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;QoS works when you understand the bottleneck:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Shape below link speed&lt;/strong&gt;: This moves the bottleneck to your router where you control queuing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use smart queues (fq-codel)&lt;/strong&gt;: They maintain low latency automatically&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prioritize appropriately&lt;/strong&gt;: Not everything can be high priority — that&apos;s just no priority&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The goal isn&apos;t faster internet — it&apos;s &lt;strong&gt;predictable&lt;/strong&gt; internet. Video calls that don&apos;t stutter when someone starts a download. SSH that stays responsive during backups. Gaming that doesn&apos;t spike during updates.&lt;/p&gt;
&lt;p&gt;Test before and after. Measure latency under load. If it doesn&apos;t improve, you haven&apos;t identified the real bottleneck yet.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>troubleshooting</category><category>tuning</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Multi-WAN on VyOS: Failover That Actually Works</title><link>https://ashimov.com/posts/vyos-multi-wan/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-multi-wan/</guid><description>Configuring reliable multi-WAN failover on VyOS with proper health checking. Covers dual ISP setup, weighted load balancing, SLA monitoring, and why failover without tracking is false confidence.</description><pubDate>Tue, 21 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Having two internet connections means nothing if your router doesn&apos;t know when one fails. I&apos;ve seen setups where the &quot;failover&quot; just meant two default routes with different metrics — the primary could be completely dead, and the router would happily keep trying to send traffic through it until the metrics were manually adjusted.&lt;/p&gt;
&lt;p&gt;Real failover requires active health checking. VyOS provides this, but it needs proper configuration. Let&apos;s build multi-WAN that actually works.&lt;/p&gt;
&lt;h2&gt;The Multi-WAN Architecture&lt;/h2&gt;
&lt;p&gt;Typical setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;eth0&lt;/strong&gt;: Primary ISP (faster, preferred)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;eth1&lt;/strong&gt;: Secondary ISP (backup)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;eth2&lt;/strong&gt;: LAN&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goals:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use primary when healthy&lt;/li&gt;
&lt;li&gt;Failover to secondary when primary fails&lt;/li&gt;
&lt;li&gt;Fail back when primary recovers&lt;/li&gt;
&lt;li&gt;All of this automatically&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Basic Interface Setup&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;configure

# Primary WAN
set interfaces ethernet eth0 description &apos;WAN-PRIMARY&apos;
set interfaces ethernet eth0 address dhcp

# Secondary WAN
set interfaces ethernet eth1 description &apos;WAN-SECONDARY&apos;
set interfaces ethernet eth1 address dhcp

# LAN
set interfaces ethernet eth2 description &apos;LAN&apos;
set interfaces ethernet eth2 address &apos;10.0.0.1/24&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Wrong Way: Static Metrics&lt;/h2&gt;
&lt;p&gt;You might think this works:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# DON&apos;T DO THIS (or at least, don&apos;t rely only on this)
set protocols static route 0.0.0.0/0 next-hop 192.168.1.1 distance 10
set protocols static route 0.0.0.0/0 next-hop 192.168.2.1 distance 20
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates two default routes. The lower distance (10) is preferred. But here&apos;s the problem: if the primary ISP goes down at layer 3 (routing issue, ISP outage, etc.), the interface might still be up. The router keeps using the &quot;preferred&quot; route that goes nowhere.&lt;/p&gt;
&lt;h2&gt;The Right Way: Health Checking&lt;/h2&gt;
&lt;p&gt;VyOS uses &lt;code&gt;conntrack-sync&lt;/code&gt; or custom scripts for health checking. A more robust approach is using &lt;code&gt;vyos-wan-load-balance&lt;/code&gt; or implementing checks with route monitoring.&lt;/p&gt;
&lt;h3&gt;Option 1: Interface State Tracking&lt;/h3&gt;
&lt;p&gt;Basic tracking — failover when interface goes down:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Primary route with interface tracking
set protocols static route 0.0.0.0/0 next-hop 192.168.1.1 distance 10
set protocols static route 0.0.0.0/0 next-hop 192.168.1.1 interface &apos;eth0&apos;

# Secondary route - used when primary interface is down
set protocols static route 0.0.0.0/0 next-hop 192.168.2.1 distance 20
set protocols static route 0.0.0.0/0 next-hop 192.168.2.1 interface &apos;eth1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This helps but only detects link failure, not upstream issues.&lt;/p&gt;
&lt;h3&gt;Option 2: Scripted Health Checks&lt;/h3&gt;
&lt;p&gt;For proper SLA monitoring, create a health check script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# /config/scripts/wan-health-check.sh

PRIMARY_GW=&quot;192.168.1.1&quot;
SECONDARY_GW=&quot;192.168.2.1&quot;
CHECK_TARGET=&quot;8.8.8.8&quot;
PRIMARY_METRIC=10
FAILOVER_METRIC=5

# Check primary WAN by pinging through it
if ping -c 3 -W 2 -I eth0 $CHECK_TARGET &amp;gt; /dev/null 2&amp;gt;&amp;amp;1; then
    # Primary is healthy - ensure it&apos;s preferred
    ip route replace default via $PRIMARY_GW metric $PRIMARY_METRIC
    ip route replace default via $SECONDARY_GW metric 20
else
    # Primary is down - make secondary preferred
    ip route replace default via $SECONDARY_GW metric $FAILOVER_METRIC
    ip route replace default via $PRIMARY_GW metric 100
    logger &quot;WAN Failover: Primary down, using secondary&quot;
fi
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule via cron:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set system task-scheduler task wan-check executable path &apos;/config/scripts/wan-health-check.sh&apos;
set system task-scheduler task wan-check interval &apos;30&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Option 3: VyOS WAN Load Balancing (Recommended)&lt;/h3&gt;
&lt;p&gt;VyOS has built-in WAN load balancing with health checks:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Define WAN interfaces for load balancing
set load-balancing wan interface-health eth0 failure-count &apos;3&apos;
set load-balancing wan interface-health eth0 nexthop &apos;192.168.1.1&apos;
set load-balancing wan interface-health eth0 success-count &apos;3&apos;
set load-balancing wan interface-health eth0 test 10 resp-time &apos;5&apos;
set load-balancing wan interface-health eth0 test 10 target &apos;8.8.8.8&apos;
set load-balancing wan interface-health eth0 test 10 ttl-limit &apos;1&apos;
set load-balancing wan interface-health eth0 test 10 type &apos;ping&apos;

set load-balancing wan interface-health eth1 failure-count &apos;3&apos;
set load-balancing wan interface-health eth1 nexthop &apos;192.168.2.1&apos;
set load-balancing wan interface-health eth1 success-count &apos;3&apos;
set load-balancing wan interface-health eth1 test 10 resp-time &apos;5&apos;
set load-balancing wan interface-health eth1 test 10 target &apos;8.8.4.4&apos;
set load-balancing wan interface-health eth1 test 10 ttl-limit &apos;1&apos;
set load-balancing wan interface-health eth1 test 10 type &apos;ping&apos;

# Define load balancing rule
set load-balancing wan rule 10 inbound-interface &apos;eth2&apos;
set load-balancing wan rule 10 interface eth0 weight &apos;100&apos;
set load-balancing wan rule 10 interface eth1 weight &apos;1&apos;
set load-balancing wan rule 10 failover

# Sticky connections (optional - keeps sessions on same WAN)
set load-balancing wan sticky-connections inbound
set load-balancing wan enable-local-traffic

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key parameters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;failure-count&lt;/strong&gt;: How many failed tests before marking down&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;success-count&lt;/strong&gt;: How many successes before marking up&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;weight&lt;/strong&gt;: Higher = more traffic (100:1 means primary gets almost all traffic)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;failover&lt;/strong&gt;: Enable failover mode (not just load balancing)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Understanding the Health Check&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;set load-balancing wan interface-health eth0 test 10 target &apos;8.8.8.8&apos;
set load-balancing wan interface-health eth0 test 10 type &apos;ping&apos;
set load-balancing wan interface-health eth0 test 10 resp-time &apos;5&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pings 8.8.8.8 through eth0. If response takes &amp;gt;5 seconds or fails, it counts as a failure. After 3 failures (failure-count), the interface is marked down.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose your test target wisely:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Public DNS (8.8.8.8, 1.1.1.1) - highly available&lt;/li&gt;
&lt;li&gt;Your ISP&apos;s gateway - tests only first hop&lt;/li&gt;
&lt;li&gt;Multiple targets for more confidence&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# Multiple tests - all must pass
set load-balancing wan interface-health eth0 test 10 target &apos;8.8.8.8&apos;
set load-balancing wan interface-health eth0 test 10 type &apos;ping&apos;
set load-balancing wan interface-health eth0 test 20 target &apos;1.1.1.1&apos;
set load-balancing wan interface-health eth0 test 20 type &apos;ping&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;NAT for Multi-WAN&lt;/h2&gt;
&lt;p&gt;Each WAN needs its own NAT rule:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# NAT for primary WAN
set nat source rule 100 outbound-interface name &apos;eth0&apos;
set nat source rule 100 source address &apos;10.0.0.0/24&apos;
set nat source rule 100 translation address &apos;masquerade&apos;

# NAT for secondary WAN
set nat source rule 110 outbound-interface name &apos;eth1&apos;
set nat source rule 110 source address &apos;10.0.0.0/24&apos;
set nat source rule 110 translation address &apos;masquerade&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;masquerade&lt;/code&gt; automatically uses the correct outbound IP based on which interface traffic exits.&lt;/p&gt;
&lt;h2&gt;Monitoring WAN Status&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# Check WAN health status
show wan-load-balance

# Check current routing
show ip route

# Check NAT sessions
show nat source translations
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Sticky Sessions: Why They Matter&lt;/h2&gt;
&lt;p&gt;Without sticky sessions, a TCP connection might start on WAN1, then mid-connection failover happens, and packets go out WAN2 with a different source IP. The remote server sees packets from a different IP and drops them.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set load-balancing wan sticky-connections inbound
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Sticky connections track existing connections and keep them on the same WAN until they complete. New connections go to whichever WAN is preferred at that moment.&lt;/p&gt;
&lt;h2&gt;Exclude Certain Traffic from Load Balancing&lt;/h2&gt;
&lt;p&gt;Some traffic should always use a specific WAN:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# VPN traffic always uses primary (to maintain stable VPN connection)
set load-balancing wan rule 5 inbound-interface &apos;eth2&apos;
set load-balancing wan rule 5 destination port &apos;51820&apos;
set load-balancing wan rule 5 protocol &apos;udp&apos;
set load-balancing wan rule 5 interface eth0 weight &apos;100&apos;

# VoIP traffic uses secondary (more stable latency)
set load-balancing wan rule 6 inbound-interface &apos;eth2&apos;
set load-balancing wan rule 6 destination port &apos;5060-5061&apos;
set load-balancing wan rule 6 protocol &apos;udp&apos;
set load-balancing wan rule 6 interface eth1 weight &apos;100&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rules are processed in order. Rule 5 and 6 handle specific traffic, rule 10 (from earlier) handles everything else.&lt;/p&gt;
&lt;h2&gt;Active-Active vs Active-Passive&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Active-Passive&lt;/strong&gt; (Failover):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set load-balancing wan rule 10 interface eth0 weight &apos;100&apos;
set load-balancing wan rule 10 interface eth1 weight &apos;1&apos;
set load-balancing wan rule 10 failover
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Primary handles all traffic. Secondary only used when primary fails.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Active-Active&lt;/strong&gt; (Load Sharing):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set load-balancing wan rule 10 interface eth0 weight &apos;70&apos;
set load-balancing wan rule 10 interface eth1 weight &apos;30&apos;
# Remove &apos;failover&apos; flag
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Traffic distributed across both. 70% to primary, 30% to secondary (roughly).&lt;/p&gt;
&lt;p&gt;Active-Active provides more bandwidth but complicates troubleshooting and may cause issues with services that expect consistent source IP.&lt;/p&gt;
&lt;h2&gt;Testing Failover&lt;/h2&gt;
&lt;p&gt;Before relying on failover, test it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify both WANs work independently&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Test via primary
ping -I eth0 8.8.8.8

# Test via secondary
ping -I eth1 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Simulate primary failure&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Temporarily block test target from primary using output filter
set firewall ipv4 name TEST rule 1 action &apos;drop&apos;
set firewall ipv4 name TEST rule 1 destination address &apos;8.8.8.8&apos;
set firewall ipv4 output filter rule 100 outbound-interface name &apos;eth0&apos;
set firewall ipv4 output filter rule 100 action &apos;jump&apos;
set firewall ipv4 output filter rule 100 jump-target &apos;TEST&apos;
commit

# Watch failover happen
show wan-load-balance

# Remove test firewall
delete firewall ipv4 name TEST
delete firewall ipv4 output filter rule 100
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Physically disconnect primary&lt;/strong&gt;
Unplug eth0. Verify traffic continues via eth1. Reconnect and verify fail-back.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Complete Multi-WAN Configuration&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# === Interfaces ===
set interfaces ethernet eth0 description &apos;WAN-PRIMARY&apos;
set interfaces ethernet eth0 address dhcp
set interfaces ethernet eth1 description &apos;WAN-SECONDARY&apos;
set interfaces ethernet eth1 address dhcp
set interfaces ethernet eth2 description &apos;LAN&apos;
set interfaces ethernet eth2 address &apos;10.0.0.1/24&apos;

# === NAT ===
set nat source rule 100 outbound-interface name &apos;eth0&apos;
set nat source rule 100 source address &apos;10.0.0.0/24&apos;
set nat source rule 100 translation address &apos;masquerade&apos;
set nat source rule 110 outbound-interface name &apos;eth1&apos;
set nat source rule 110 source address &apos;10.0.0.0/24&apos;
set nat source rule 110 translation address &apos;masquerade&apos;

# === WAN Load Balancing with Health Check ===
set load-balancing wan interface-health eth0 failure-count &apos;3&apos;
set load-balancing wan interface-health eth0 nexthop &apos;dhcp&apos;
set load-balancing wan interface-health eth0 success-count &apos;3&apos;
set load-balancing wan interface-health eth0 test 10 resp-time &apos;5&apos;
set load-balancing wan interface-health eth0 test 10 target &apos;8.8.8.8&apos;
set load-balancing wan interface-health eth0 test 10 type &apos;ping&apos;

set load-balancing wan interface-health eth1 failure-count &apos;3&apos;
set load-balancing wan interface-health eth1 nexthop &apos;dhcp&apos;
set load-balancing wan interface-health eth1 success-count &apos;3&apos;
set load-balancing wan interface-health eth1 test 10 resp-time &apos;5&apos;
set load-balancing wan interface-health eth1 test 10 target &apos;8.8.4.4&apos;
set load-balancing wan interface-health eth1 test 10 type &apos;ping&apos;

set load-balancing wan rule 10 inbound-interface &apos;eth2&apos;
set load-balancing wan rule 10 interface eth0 weight &apos;100&apos;
set load-balancing wan rule 10 interface eth1 weight &apos;1&apos;
set load-balancing wan rule 10 failover

set load-balancing wan sticky-connections inbound
set load-balancing wan enable-local-traffic
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;Multi-WAN without proper health checking is false confidence. Your router might report two routes while happily sending traffic into a black hole.&lt;/p&gt;
&lt;p&gt;Real failover requires:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Active health checks&lt;/strong&gt; that test actual connectivity, not just link state&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reasonable timers&lt;/strong&gt; - fast enough to detect failures quickly, slow enough to avoid flapping&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Testing&lt;/strong&gt; - verify failover actually works before you need it&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring&lt;/strong&gt; - alerts when failover happens so you know to investigate&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;VyOS&apos;s WAN load balancing provides all of this out of the box. Configure it, test it, and trust it — but verify with monitoring.&lt;/p&gt;
</content:encoded><category>vyos</category><category>ha</category><category>networking</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>IPsec on VyOS: Site-to-Site Tunnels That Survive Reality</title><link>https://ashimov.com/posts/vyos-ipsec/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-ipsec/</guid><description>Configuring reliable IPsec site-to-site VPNs on VyOS. Covers IKEv2 setup, NAT traversal, dead peer detection, rekeying, and systematic debugging when things go wrong.</description><pubDate>Fri, 17 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;IPsec has a reputation for being complex and fragile. There&apos;s some truth to that — it has more moving parts than WireGuard, more states to manage, more things that can go wrong. But IPsec is also universal. It works with nearly any vendor&apos;s equipment and is often required for corporate connectivity.&lt;/p&gt;
&lt;p&gt;The key to reliable IPsec: understand the timers and states. When an IPsec tunnel fails, it&apos;s almost always timer mismatch or phase state issues. This guide covers practical IPsec configuration and the debugging skills to diagnose problems.&lt;/p&gt;
&lt;h2&gt;IPsec Fundamentals&lt;/h2&gt;
&lt;p&gt;IPsec has two phases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IKE Phase 1 (IKE SA)&lt;/strong&gt;: Negotiate encryption, authenticate peers, establish secure channel for Phase 2 negotiations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IKE Phase 2 (IPsec SA / Child SA)&lt;/strong&gt;: Negotiate the actual tunnel parameters, establish encryption for user traffic.&lt;/p&gt;
&lt;p&gt;Both phases have lifetimes. When they expire, rekeying occurs. Mismatched timers between peers cause tunnels to drop.&lt;/p&gt;
&lt;h2&gt;Site-to-Site Configuration: IKEv2&lt;/h2&gt;
&lt;p&gt;IKEv2 is preferred over IKEv1 — better NAT traversal, faster failover, simpler configuration. Use IKEv1 only when the peer doesn&apos;t support IKEv2.&lt;/p&gt;
&lt;h3&gt;Site A Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# IKE group (Phase 1 parameters)
set vpn ipsec ike-group IKE-SITE-B close-action &apos;none&apos;
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection action &apos;restart&apos;
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection interval &apos;30&apos;
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection timeout &apos;120&apos;
set vpn ipsec ike-group IKE-SITE-B key-exchange &apos;ikev2&apos;
set vpn ipsec ike-group IKE-SITE-B lifetime &apos;28800&apos;
set vpn ipsec ike-group IKE-SITE-B proposal 1 dh-group &apos;14&apos;
set vpn ipsec ike-group IKE-SITE-B proposal 1 encryption &apos;aes256&apos;
set vpn ipsec ike-group IKE-SITE-B proposal 1 hash &apos;sha256&apos;

# ESP group (Phase 2 parameters)
set vpn ipsec esp-group ESP-SITE-B lifetime &apos;3600&apos;
set vpn ipsec esp-group ESP-SITE-B pfs &apos;dh-group14&apos;
set vpn ipsec esp-group ESP-SITE-B proposal 1 encryption &apos;aes256&apos;
set vpn ipsec esp-group ESP-SITE-B proposal 1 hash &apos;sha256&apos;

# Interface binding
set vpn ipsec interface &apos;eth0&apos;

# Site-to-Site connection
set vpn ipsec site-to-site peer SITE-B authentication local-id &apos;site-a@example.com&apos;
set vpn ipsec site-to-site peer SITE-B authentication mode &apos;pre-shared-secret&apos;
set vpn ipsec site-to-site peer SITE-B authentication pre-shared-secret &apos;YourVeryStrongPSKHere123!&apos;
set vpn ipsec site-to-site peer SITE-B authentication remote-id &apos;site-b@example.com&apos;
set vpn ipsec site-to-site peer SITE-B connection-type &apos;initiate&apos;
set vpn ipsec site-to-site peer SITE-B default-esp-group &apos;ESP-SITE-B&apos;
set vpn ipsec site-to-site peer SITE-B ike-group &apos;IKE-SITE-B&apos;
set vpn ipsec site-to-site peer SITE-B local-address &apos;203.0.113.1&apos;
set vpn ipsec site-to-site peer SITE-B remote-address &apos;198.51.100.1&apos;

# Traffic selectors (what traffic goes through the tunnel)
set vpn ipsec site-to-site peer SITE-B tunnel 1 local prefix &apos;10.1.0.0/24&apos;
set vpn ipsec site-to-site peer SITE-B tunnel 1 remote prefix &apos;10.2.0.0/24&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Site B Configuration&lt;/h3&gt;
&lt;p&gt;Mirror configuration with swapped addresses and IDs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# IKE group - MUST MATCH Site A
set vpn ipsec ike-group IKE-SITE-A close-action &apos;none&apos;
set vpn ipsec ike-group IKE-SITE-A dead-peer-detection action &apos;restart&apos;
set vpn ipsec ike-group IKE-SITE-A dead-peer-detection interval &apos;30&apos;
set vpn ipsec ike-group IKE-SITE-A dead-peer-detection timeout &apos;120&apos;
set vpn ipsec ike-group IKE-SITE-A key-exchange &apos;ikev2&apos;
set vpn ipsec ike-group IKE-SITE-A lifetime &apos;28800&apos;
set vpn ipsec ike-group IKE-SITE-A proposal 1 dh-group &apos;14&apos;
set vpn ipsec ike-group IKE-SITE-A proposal 1 encryption &apos;aes256&apos;
set vpn ipsec ike-group IKE-SITE-A proposal 1 hash &apos;sha256&apos;

# ESP group - MUST MATCH Site A
set vpn ipsec esp-group ESP-SITE-A lifetime &apos;3600&apos;
set vpn ipsec esp-group ESP-SITE-A pfs &apos;dh-group14&apos;
set vpn ipsec esp-group ESP-SITE-A proposal 1 encryption &apos;aes256&apos;
set vpn ipsec esp-group ESP-SITE-A proposal 1 hash &apos;sha256&apos;

set vpn ipsec interface &apos;eth0&apos;

set vpn ipsec site-to-site peer SITE-A authentication local-id &apos;site-b@example.com&apos;
set vpn ipsec site-to-site peer SITE-A authentication mode &apos;pre-shared-secret&apos;
set vpn ipsec site-to-site peer SITE-A authentication pre-shared-secret &apos;YourVeryStrongPSKHere123!&apos;
set vpn ipsec site-to-site peer SITE-A authentication remote-id &apos;site-a@example.com&apos;
set vpn ipsec site-to-site peer SITE-A connection-type &apos;initiate&apos;
set vpn ipsec site-to-site peer SITE-A default-esp-group &apos;ESP-SITE-A&apos;
set vpn ipsec site-to-site peer SITE-A ike-group &apos;IKE-SITE-A&apos;
set vpn ipsec site-to-site peer SITE-A local-address &apos;198.51.100.1&apos;
set vpn ipsec site-to-site peer SITE-A remote-address &apos;203.0.113.1&apos;

set vpn ipsec site-to-site peer SITE-A tunnel 1 local prefix &apos;10.2.0.0/24&apos;
set vpn ipsec site-to-site peer SITE-A tunnel 1 remote prefix &apos;10.1.0.0/24&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Critical: Parameter Matching&lt;/h2&gt;
&lt;p&gt;Both peers MUST have identical:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Key exchange version (ikev2)&lt;/li&gt;
&lt;li&gt;IKE lifetime&lt;/li&gt;
&lt;li&gt;ESP lifetime&lt;/li&gt;
&lt;li&gt;DH group&lt;/li&gt;
&lt;li&gt;Encryption algorithm&lt;/li&gt;
&lt;li&gt;Hash algorithm&lt;/li&gt;
&lt;li&gt;PFS settings&lt;/li&gt;
&lt;li&gt;Traffic selectors (swapped local/remote)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Mismatch in any of these = tunnel won&apos;t establish or will randomly fail.&lt;/p&gt;
&lt;h2&gt;NAT Traversal (NAT-T)&lt;/h2&gt;
&lt;p&gt;When either peer is behind NAT, IPsec encapsulates packets in UDP 4500 instead of raw ESP (protocol 50). VyOS enables NAT-T automatically, but you may need firewall rules:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set firewall ipv4 name WAN-LOCAL rule 70 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 70 protocol &apos;udp&apos;
set firewall ipv4 name WAN-LOCAL rule 70 destination port &apos;500&apos;

set firewall ipv4 name WAN-LOCAL rule 71 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 71 protocol &apos;udp&apos;
set firewall ipv4 name WAN-LOCAL rule 71 destination port &apos;4500&apos;

# Also allow ESP protocol for non-NAT scenarios
set firewall ipv4 name WAN-LOCAL rule 72 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 72 protocol &apos;esp&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Dead Peer Detection (DPD)&lt;/h2&gt;
&lt;p&gt;DPD detects when the remote peer becomes unreachable. Without it, your router won&apos;t know the tunnel is dead until traffic fails.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set vpn ipsec ike-group IKE-SITE-B dead-peer-detection action &apos;restart&apos;
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection interval &apos;30&apos;
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection timeout &apos;120&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;interval&lt;/strong&gt;: Send DPD request every 30 seconds&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;timeout&lt;/strong&gt;: Declare peer dead after 120 seconds without response&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;action&lt;/strong&gt;: restart = try to re-establish, clear = delete SA, none = do nothing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For stable connections, &lt;code&gt;restart&lt;/code&gt; is usually best. It automatically recovers from transient outages.&lt;/p&gt;
&lt;h2&gt;Rekeying: The Lifetime Dance&lt;/h2&gt;
&lt;p&gt;IKE and ESP SAs have separate lifetimes. When they expire, rekeying occurs. Problems happen when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Both peers try to rekey simultaneously&lt;/li&gt;
&lt;li&gt;Timers differ slightly, causing race conditions&lt;/li&gt;
&lt;li&gt;Rekey fails and tunnel drops&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Best practices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Make lifetimes identical on both peers&lt;/li&gt;
&lt;li&gt;IKE lifetime should be longer than ESP lifetime&lt;/li&gt;
&lt;li&gt;Common values: IKE 28800s (8h), ESP 3600s (1h)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# Site A and Site B MUST match
set vpn ipsec ike-group IKE-SITE-B lifetime &apos;28800&apos;
set vpn ipsec esp-group ESP-SITE-B lifetime &apos;3600&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If rekeying causes drops, increase lifetimes. If security policy requires short lifetimes, ensure DPD is configured to recover quickly.&lt;/p&gt;
&lt;h2&gt;Debugging IPsec&lt;/h2&gt;
&lt;p&gt;When IPsec fails, check in order:&lt;/p&gt;
&lt;h3&gt;1. Check SA Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show vpn ipsec sa
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Shows established Security Associations. You want to see both IKE SA and Child SA (ESP).&lt;/p&gt;
&lt;h3&gt;2. Check Connection State&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show vpn ipsec connections
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Shows connection status: ESTABLISHED, CONNECTING, INSTALLED, etc.&lt;/p&gt;
&lt;h3&gt;3. Check Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show log | match -i ipsec
# or
sudo journalctl -u strongswan -f
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Common errors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;NO_PROPOSAL_CHOSEN&lt;/code&gt;: Algorithm mismatch&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AUTHENTICATION_FAILED&lt;/code&gt;: Wrong PSK or ID mismatch&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TS_UNACCEPTABLE&lt;/code&gt;: Traffic selector mismatch&lt;/li&gt;
&lt;li&gt;&lt;code&gt;INVALID_IKE_SPI&lt;/code&gt;: Stale SA, restart connection&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Reset the Connection&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;reset vpn ipsec site-to-site peer SITE-B
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clears SAs and re-initiates. Often fixes &quot;stuck&quot; tunnels.&lt;/p&gt;
&lt;h3&gt;5. Verify Traffic Selectors&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show vpn ipsec sa detail
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Shows exactly what traffic selectors are installed. Mismatch here = traffic bypasses tunnel.&lt;/p&gt;
&lt;h3&gt;Common Issues&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No SA established&lt;/td&gt;
&lt;td&gt;Firewall blocking 500/4500/ESP&lt;/td&gt;
&lt;td&gt;Open firewall ports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth failed&lt;/td&gt;
&lt;td&gt;PSK mismatch or wrong local/remote-id&lt;/td&gt;
&lt;td&gt;Verify both match exactly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connects then drops&lt;/td&gt;
&lt;td&gt;Timer mismatch or rekey failure&lt;/td&gt;
&lt;td&gt;Match lifetimes, check DPD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic doesn&apos;t flow&lt;/td&gt;
&lt;td&gt;Traffic selector mismatch&lt;/td&gt;
&lt;td&gt;Verify local/remote prefix match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works then stops&lt;/td&gt;
&lt;td&gt;NAT timeout (if behind NAT)&lt;/td&gt;
&lt;td&gt;Ensure NAT-T is working&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Debug Commands Summary&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show vpn ipsec sa              # SA status
show vpn ipsec connections     # Connection state
show vpn ipsec sa detail       # Detailed SA info including traffic selectors
show vpn ipsec status          # Overall IPsec status
reset vpn ipsec site-to-site peer &amp;lt;name&amp;gt;  # Reset specific peer
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Route-Based vs Policy-Based IPsec&lt;/h2&gt;
&lt;p&gt;VyOS supports both:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Policy-based&lt;/strong&gt; (shown above): Traffic selectors define what goes through tunnel. Configured via &lt;code&gt;tunnel X local/remote prefix&lt;/code&gt;. Simpler, but less flexible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Route-based&lt;/strong&gt;: Virtual tunnel interface (vti), routes determine what traffic enters. More flexible, better for dynamic routing.&lt;/p&gt;
&lt;h3&gt;Route-Based Example&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# Virtual tunnel interface
set interfaces vti vti0 address &apos;10.255.0.1/30&apos;
set interfaces vti vti0 description &apos;IPsec to Site B&apos;

# IPsec connection using vti
set vpn ipsec site-to-site peer SITE-B tunnel 1 local prefix &apos;0.0.0.0/0&apos;
set vpn ipsec site-to-site peer SITE-B tunnel 1 remote prefix &apos;0.0.0.0/0&apos;
set vpn ipsec site-to-site peer SITE-B vti bind &apos;vti0&apos;

# Route traffic to remote network through vti
set protocols static route 10.2.0.0/24 interface vti0

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Route-based is better when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need dynamic routing (OSPF/BGP over IPsec)&lt;/li&gt;
&lt;li&gt;Multiple networks with complex routing&lt;/li&gt;
&lt;li&gt;You want firewall rules on the tunnel interface&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Firewall for IPsec Traffic&lt;/h2&gt;
&lt;p&gt;Traffic through IPsec still needs firewall consideration:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# If using VTI, apply firewall via forward filter
# Define what the remote site can access
set firewall ipv4 name IPSEC-IN default-action &apos;drop&apos;
set firewall ipv4 name IPSEC-IN rule 10 action &apos;accept&apos;
set firewall ipv4 name IPSEC-IN rule 10 state &apos;established&apos;
set firewall ipv4 name IPSEC-IN rule 10 state &apos;related&apos;
set firewall ipv4 name IPSEC-IN rule 20 action &apos;accept&apos;
set firewall ipv4 name IPSEC-IN rule 20 destination address &apos;10.1.0.0/24&apos;

# Apply to forward filter
set firewall ipv4 forward filter rule 30 inbound-interface name &apos;vti0&apos;
set firewall ipv4 forward filter rule 30 action &apos;jump&apos;
set firewall ipv4 forward filter rule 30 jump-target &apos;IPSEC-IN&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Production Checklist&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] IKE and ESP parameters match on both peers&lt;/li&gt;
&lt;li&gt;[ ] Lifetimes match exactly&lt;/li&gt;
&lt;li&gt;[ ] DPD configured with appropriate action&lt;/li&gt;
&lt;li&gt;[ ] Firewall allows UDP 500, 4500, and ESP&lt;/li&gt;
&lt;li&gt;[ ] Traffic selectors match (local/remote swapped)&lt;/li&gt;
&lt;li&gt;[ ] PSK is strong (20+ random characters)&lt;/li&gt;
&lt;li&gt;[ ] Local and remote IDs match configuration&lt;/li&gt;
&lt;li&gt;[ ] Monitoring/alerting for tunnel status&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;IPsec reliability comes down to discipline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Timer discipline&lt;/strong&gt;: Both peers must have identical lifetimes. IKE &amp;gt; ESP lifetime. Configure DPD to detect and recover from failures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SA verification&lt;/strong&gt;: Regularly check &lt;code&gt;show vpn ipsec sa&lt;/code&gt;. If IKE SA exists but Child SA doesn&apos;t, you have a Phase 2 problem. If neither exists, Phase 1 isn&apos;t completing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Methodical debugging&lt;/strong&gt;: Check SA status → check logs for specific error → verify matching configuration → reset and try again.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;IPsec is more complex than WireGuard, but it&apos;s deterministic. When you understand the state machine (IKE SA → Child SA → traffic flows), you can diagnose any issue by figuring out where in that sequence things break.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>security</category><category>vpn</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>WireGuard on VyOS: Production Configuration for Site-to-Site and Road Warriors</title><link>https://ashimov.com/posts/vyos-wireguard/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-wireguard/</guid><description>Complete WireGuard setup on VyOS covering site-to-site tunnels, mobile clients, kill switches, split vs full tunnel, and the two things that make WireGuard stable: MTU and routing policy.</description><pubDate>Tue, 14 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;WireGuard is simple by design — a few keys, some IP addresses, and you&apos;re connected. But &quot;connected&quot; and &quot;production-ready&quot; are different things. Intermittent disconnections, mysterious packet loss, traffic leaking outside the tunnel — these happen when you skip the fundamentals.&lt;/p&gt;
&lt;p&gt;The two things that make WireGuard stable: &lt;strong&gt;correct MTU&lt;/strong&gt; and &lt;strong&gt;clear routing policy&lt;/strong&gt;. Get these right, and WireGuard becomes boring (in the best way).&lt;/p&gt;
&lt;h2&gt;WireGuard Basics on VyOS&lt;/h2&gt;
&lt;p&gt;VyOS treats WireGuard as a first-class interface. Configuration is straightforward:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Generate keypair (or use existing)
run generate pki wireguard key-pair

# Save the output:
# Private-key: &amp;lt;base64 private key&amp;gt;
# Public-key: &amp;lt;base64 public key&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Store the private key securely. The public key is what you share with peers.&lt;/p&gt;
&lt;h2&gt;Site-to-Site Configuration&lt;/h2&gt;
&lt;p&gt;Two VyOS routers connecting their networks. Site A (10.1.0.0/24) connects to Site B (10.2.0.0/24).&lt;/p&gt;
&lt;h3&gt;Site A Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

# WireGuard interface
set interfaces wireguard wg0 address &apos;10.255.255.1/30&apos;
set interfaces wireguard wg0 description &apos;Site-to-Site to Site B&apos;
set interfaces wireguard wg0 port &apos;51820&apos;
set interfaces wireguard wg0 private-key &apos;&amp;lt;site-a-private-key&amp;gt;&apos;

# Peer configuration (Site B)
set interfaces wireguard wg0 peer site-b public-key &apos;&amp;lt;site-b-public-key&amp;gt;&apos;
set interfaces wireguard wg0 peer site-b allowed-ips &apos;10.255.255.2/32&apos;
set interfaces wireguard wg0 peer site-b allowed-ips &apos;10.2.0.0/24&apos;
set interfaces wireguard wg0 peer site-b endpoint &apos;site-b.example.com:51820&apos;
set interfaces wireguard wg0 peer site-b persistent-keepalive &apos;25&apos;

# Route to Site B&apos;s network
set protocols static route 10.2.0.0/24 interface wg0

# Firewall: allow WireGuard traffic
set firewall ipv4 name WAN-LOCAL rule 60 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 60 protocol &apos;udp&apos;
set firewall ipv4 name WAN-LOCAL rule 60 destination port &apos;51820&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Site B Configuration&lt;/h3&gt;
&lt;p&gt;Mirror configuration with swapped keys and addresses:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set interfaces wireguard wg0 address &apos;10.255.255.2/30&apos;
set interfaces wireguard wg0 description &apos;Site-to-Site to Site A&apos;
set interfaces wireguard wg0 port &apos;51820&apos;
set interfaces wireguard wg0 private-key &apos;&amp;lt;site-b-private-key&amp;gt;&apos;

set interfaces wireguard wg0 peer site-a public-key &apos;&amp;lt;site-a-public-key&amp;gt;&apos;
set interfaces wireguard wg0 peer site-a allowed-ips &apos;10.255.255.1/32&apos;
set interfaces wireguard wg0 peer site-a allowed-ips &apos;10.1.0.0/24&apos;
set interfaces wireguard wg0 peer site-a endpoint &apos;site-a.example.com:51820&apos;
set interfaces wireguard wg0 peer site-a persistent-keepalive &apos;25&apos;

set protocols static route 10.1.0.0/24 interface wg0

set firewall ipv4 name WAN-LOCAL rule 60 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 60 protocol &apos;udp&apos;
set firewall ipv4 name WAN-LOCAL rule 60 destination port &apos;51820&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check interface status
show interfaces wireguard wg0

# Check peer status
show wireguard peers

# Test connectivity
ping 10.255.255.2    # Tunnel endpoint
ping 10.2.0.1        # Remote LAN (from Site A)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Road Warrior Configuration (Mobile Clients)&lt;/h2&gt;
&lt;p&gt;Roaming clients that connect from anywhere. The VyOS router acts as the VPN server.&lt;/p&gt;
&lt;h3&gt;VyOS Server Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure

set interfaces wireguard wg0 address &apos;10.10.0.1/24&apos;
set interfaces wireguard wg0 description &apos;Road Warrior VPN&apos;
set interfaces wireguard wg0 port &apos;51820&apos;
set interfaces wireguard wg0 private-key &apos;&amp;lt;server-private-key&amp;gt;&apos;

# Client 1 (laptop)
set interfaces wireguard wg0 peer laptop public-key &apos;&amp;lt;laptop-public-key&amp;gt;&apos;
set interfaces wireguard wg0 peer laptop allowed-ips &apos;10.10.0.10/32&apos;

# Client 2 (phone)
set interfaces wireguard wg0 peer phone public-key &apos;&amp;lt;phone-public-key&amp;gt;&apos;
set interfaces wireguard wg0 peer phone allowed-ips &apos;10.10.0.11/32&apos;

# Allow VPN clients to access LAN and internet
set firewall group network-group VPN-CLIENTS network &apos;10.10.0.0/24&apos;

# NAT for VPN clients going to internet
set nat source rule 200 outbound-interface name &apos;eth0&apos;
set nat source rule 200 source address &apos;10.10.0.0/24&apos;
set nat source rule 200 translation address &apos;masquerade&apos;

# Firewall: allow WireGuard
set firewall ipv4 name WAN-LOCAL rule 60 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 60 protocol &apos;udp&apos;
set firewall ipv4 name WAN-LOCAL rule 60 destination port &apos;51820&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Client Configuration (wg0.conf)&lt;/h3&gt;
&lt;p&gt;For laptop/phone using standard WireGuard client:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Interface]
PrivateKey = &amp;lt;laptop-private-key&amp;gt;
Address = 10.10.0.10/32
DNS = 10.0.0.1

[Peer]
PublicKey = &amp;lt;server-public-key&amp;gt;
AllowedIPs = 0.0.0.0/0
Endpoint = vpn.example.com:51820
PersistentKeepalive = 25
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AllowedIPs = 0.0.0.0/0&lt;/code&gt; means full tunnel — all traffic through VPN.&lt;/p&gt;
&lt;h2&gt;Split Tunnel vs Full Tunnel&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Full tunnel&lt;/strong&gt;: All client traffic goes through VPN&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AllowedIPs = 0.0.0.0/0, ::/0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Pros: All traffic protected, consistent IP&lt;/li&gt;
&lt;li&gt;Cons: Higher latency, more bandwidth on VPN server&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Split tunnel&lt;/strong&gt;: Only specific traffic through VPN&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AllowedIPs = 10.0.0.0/8, 192.168.0.0/16&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Pros: Better performance, less server load&lt;/li&gt;
&lt;li&gt;Cons: Some traffic exposed, DNS leaks possible&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Split Tunnel Client Example&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[Interface]
PrivateKey = &amp;lt;private-key&amp;gt;
Address = 10.10.0.10/32

[Peer]
PublicKey = &amp;lt;server-public-key&amp;gt;
AllowedIPs = 10.0.0.0/8, 10.10.0.0/24
Endpoint = vpn.example.com:51820
PersistentKeepalive = 25
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Only traffic to 10.x.x.x goes through VPN. Everything else uses local internet.&lt;/p&gt;
&lt;h2&gt;The MTU Problem&lt;/h2&gt;
&lt;p&gt;WireGuard encapsulates packets, adding overhead. If your MTU is too high, packets get fragmented or dropped. Symptoms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SSH works, HTTPS fails&lt;/li&gt;
&lt;li&gt;Small requests work, large transfers hang&lt;/li&gt;
&lt;li&gt;Intermittent &quot;connection reset&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Calculate Correct MTU&lt;/h3&gt;
&lt;p&gt;Standard Ethernet MTU: 1500
WireGuard overhead: 60 bytes (IPv4) or 80 bytes (IPv6)
Safe WireGuard MTU: &lt;strong&gt;1420&lt;/strong&gt; (IPv4) or &lt;strong&gt;1400&lt;/strong&gt; (IPv6)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

set interfaces wireguard wg0 mtu &apos;1420&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test MTU&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# From client, test path MTU to a host through the tunnel
ping -M do -s 1392 10.0.0.1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;-M do&lt;/code&gt; prevents fragmentation. &lt;code&gt;-s 1392&lt;/code&gt; = 1392 payload + 28 header = 1420. If it works, MTU is correct. If not, lower it.&lt;/p&gt;
&lt;p&gt;For connections through multiple NATs or tunnels, you might need 1380 or even lower.&lt;/p&gt;
&lt;h2&gt;Kill Switch: Preventing Leaks&lt;/h2&gt;
&lt;p&gt;A kill switch ensures traffic can&apos;t leak if the VPN disconnects. On VyOS server, you control routing. On clients, configure the client app or OS firewall.&lt;/p&gt;
&lt;h3&gt;Server-Side: Ensure Clients Use VPN&lt;/h3&gt;
&lt;p&gt;If clients should only access internet through VPN:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Already covered by NAT rule - VPN clients are masqueraded
# No direct route from VPN subnet to internet except through NAT
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Client-Side Kill Switch (Linux)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Allow only WireGuard and local traffic
iptables -A OUTPUT -o wg0 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A OUTPUT -p udp --dport 51820 -j ACCEPT
iptables -A OUTPUT -j DROP
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or use WireGuard&apos;s PostUp/PostDown scripts in the config.&lt;/p&gt;
&lt;h2&gt;Persistent Keepalive: When to Use It&lt;/h2&gt;
&lt;p&gt;WireGuard is silent when idle — no traffic means no packets. This causes problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NAT mappings expire (typically 30-60 seconds)&lt;/li&gt;
&lt;li&gt;Stateful firewalls drop the &quot;connection&quot;&lt;/li&gt;
&lt;li&gt;You can&apos;t initiate connections TO the client&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;persistent-keepalive &apos;25&apos;&lt;/code&gt; sends a keepalive every 25 seconds, keeping NAT/firewall state alive.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use it when&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Client is behind NAT&lt;/li&gt;
&lt;li&gt;Either side has stateful firewall&lt;/li&gt;
&lt;li&gt;You need to reach the client from the server&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Skip it when&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Both sides have static public IPs&lt;/li&gt;
&lt;li&gt;No NAT involved&lt;/li&gt;
&lt;li&gt;Saving minimal bandwidth matters&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Debugging WireGuard&lt;/h2&gt;
&lt;h3&gt;Check Interface Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show interfaces wireguard wg0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should show UP state and assigned address.&lt;/p&gt;
&lt;h3&gt;Check Peer Handshakes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show wireguard peers
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Shows last handshake time. If &quot;never&quot; or very old, tunnel isn&apos;t working.&lt;/p&gt;
&lt;h3&gt;Check Keys Match&lt;/h3&gt;
&lt;p&gt;Most common issue: public/private key mismatch. Verify:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Server has client&apos;s public key&lt;/li&gt;
&lt;li&gt;Client has server&apos;s public key&lt;/li&gt;
&lt;li&gt;No copy-paste errors (check for trailing spaces)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Check Firewall&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show firewall ipv4 name WAN-LOCAL
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ensure UDP 51820 is allowed inbound.&lt;/p&gt;
&lt;h3&gt;Check Routing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show ip route
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Routes through wg0 should exist for peer&apos;s allowed-ips.&lt;/p&gt;
&lt;h3&gt;Monitor Traffic&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;sudo tcpdump -i wg0 -n
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should see traffic when peers communicate.&lt;/p&gt;
&lt;h3&gt;Common Issues&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No handshake&lt;/td&gt;
&lt;td&gt;Key mismatch or blocked port&lt;/td&gt;
&lt;td&gt;Verify keys, check firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handshake but no traffic&lt;/td&gt;
&lt;td&gt;Routing or allowed-ips wrong&lt;/td&gt;
&lt;td&gt;Check routes match allowed-ips&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works then dies&lt;/td&gt;
&lt;td&gt;NAT timeout&lt;/td&gt;
&lt;td&gt;Enable persistent-keepalive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large transfers fail&lt;/td&gt;
&lt;td&gt;MTU too high&lt;/td&gt;
&lt;td&gt;Lower MTU to 1420 or less&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One direction works&lt;/td&gt;
&lt;td&gt;Asymmetric allowed-ips&lt;/td&gt;
&lt;td&gt;Both sides need matching allowed-ips&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Production Checklist&lt;/h2&gt;
&lt;p&gt;Before calling it production-ready:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;[ ] MTU set correctly (1420 or tested value)&lt;/li&gt;
&lt;li&gt;[ ] Persistent keepalive enabled if behind NAT&lt;/li&gt;
&lt;li&gt;[ ] Firewall allows WireGuard port (UDP 51820)&lt;/li&gt;
&lt;li&gt;[ ] Routes exist for all allowed-ips&lt;/li&gt;
&lt;li&gt;[ ] Keys are backed up securely&lt;/li&gt;
&lt;li&gt;[ ] NAT configured if clients need internet access&lt;/li&gt;
&lt;li&gt;[ ] DNS configured for full-tunnel clients&lt;/li&gt;
&lt;li&gt;[ ] Kill switch configured if leak prevention needed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Complete Road Warrior Example&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# === WireGuard Interface ===
set interfaces wireguard wg0 address &apos;10.10.0.1/24&apos;
set interfaces wireguard wg0 mtu &apos;1420&apos;
set interfaces wireguard wg0 port &apos;51820&apos;
set interfaces wireguard wg0 private-key &apos;&amp;lt;server-private-key&amp;gt;&apos;

# === Peers ===
set interfaces wireguard wg0 peer laptop public-key &apos;&amp;lt;key&amp;gt;&apos;
set interfaces wireguard wg0 peer laptop allowed-ips &apos;10.10.0.10/32&apos;

set interfaces wireguard wg0 peer phone public-key &apos;&amp;lt;key&amp;gt;&apos;
set interfaces wireguard wg0 peer phone allowed-ips &apos;10.10.0.11/32&apos;

# === NAT for VPN clients ===
set nat source rule 200 outbound-interface name &apos;eth0&apos;
set nat source rule 200 source address &apos;10.10.0.0/24&apos;
set nat source rule 200 translation address &apos;masquerade&apos;

# === Firewall ===
set firewall ipv4 name WAN-LOCAL rule 60 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 60 protocol &apos;udp&apos;
set firewall ipv4 name WAN-LOCAL rule 60 destination port &apos;51820&apos;

# VPN clients to LAN (apply via forward filter)
set firewall ipv4 name VPN-TO-LAN default-action &apos;accept&apos;
set firewall ipv4 forward filter rule 50 inbound-interface name &apos;wg0&apos;
set firewall ipv4 forward filter rule 50 action &apos;jump&apos;
set firewall ipv4 forward filter rule 50 jump-target &apos;VPN-TO-LAN&apos;

# === DNS for VPN clients ===
set service dns forwarding listen-address &apos;10.10.0.1&apos;
set service dns forwarding allow-from &apos;10.10.0.0/24&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;WireGuard becomes stable after addressing two things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MTU&lt;/strong&gt;: Set it explicitly to 1420 or lower. Don&apos;t rely on automatic MTU discovery — it often fails through NAT and firewalls.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Routing policy&lt;/strong&gt;: Be explicit about what traffic goes where. &lt;code&gt;allowed-ips&lt;/code&gt; controls both routing AND cryptographic acceptance. If it&apos;s not in allowed-ips, it won&apos;t be accepted even if routed correctly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Everything else — keepalives, firewall rules, NAT — follows logically from the use case. But MTU and routing policy are where most WireGuard problems live. Fix those, and the rest falls into place.&lt;/p&gt;
</content:encoded><category>vyos</category><category>networking</category><category>security</category><category>vpn</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Policy-Based Routing on VyOS: Practical Patterns for Split Routing</title><link>https://ashimov.com/posts/vyos-pbr/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-pbr/</guid><description>How to route specific traffic through different gateways on VyOS. Covers routing by source, destination, domain, and application with real-world examples like split-tunnel VPN.</description><pubDate>Fri, 10 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Standard routing is simple: packets go to the destination via the best route in the table. But what if you need specific traffic to take a different path? Work traffic through the VPN, streaming through the ISP, certain devices always through a specific gateway?&lt;/p&gt;
&lt;p&gt;That&apos;s Policy-Based Routing (PBR). VyOS implements PBR through policy routes and routing tables. It sounds complex, but the pattern is simple: match the traffic, then route it to a specific table.&lt;/p&gt;
&lt;h2&gt;The PBR Mental Model&lt;/h2&gt;
&lt;p&gt;Two components work together:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Routing table&lt;/strong&gt;: Alternative routes for specific traffic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Policy route&lt;/strong&gt;: Match traffic and direct to the appropriate table&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;Packet arrives → Policy route matches → Routed via alternate table
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;VyOS 1.4 &lt;code&gt;policy route&lt;/code&gt; can match traffic directly (by source, destination, protocol, etc.) and route it to a specific table — no firewall marks needed for most cases.&lt;/p&gt;
&lt;h2&gt;Scenario 1: Route Specific Subnet Through VPN&lt;/h2&gt;
&lt;p&gt;Let&apos;s say you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Default internet via eth0 (ISP)&lt;/li&gt;
&lt;li&gt;WireGuard VPN on wg0&lt;/li&gt;
&lt;li&gt;Want 10.0.0.0/24 (work devices) to use VPN&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create a separate routing table for VPN traffic
set protocols static table 10 route 0.0.0.0/0 interface wg0

# Policy route: match source and set table
set policy route PBR rule 10 source address &apos;10.0.0.0/24&apos;
set policy route PBR rule 10 set table &apos;10&apos;

# Apply policy to LAN interface
set policy route PBR interface &apos;eth1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check routing table 10 exists
show ip route table 10

# From a work device, check public IP
# Should show VPN exit IP, not ISP IP
curl ifconfig.me
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Scenario 2: Route by Destination (Specific Sites Through VPN)&lt;/h2&gt;
&lt;p&gt;Route only certain destinations through VPN, everything else direct.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Routing table for VPN
set protocols static table 20 route 0.0.0.0/0 interface wg0

# Define destinations (IP ranges of services you want through VPN)
set firewall group network-group VPN-DESTINATIONS network &apos;203.0.113.0/24&apos;
set firewall group network-group VPN-DESTINATIONS network &apos;198.51.100.0/24&apos;

# Policy route: match destination and set table
set policy route PBR-DEST rule 10 destination group network-group &apos;VPN-DESTINATIONS&apos;
set policy route PBR-DEST rule 10 set table &apos;20&apos;
set policy route PBR-DEST interface &apos;eth1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Scenario 3: Route by Domain (Using DNS-Based Groups)&lt;/h2&gt;
&lt;p&gt;VyOS 1.4+ supports domain groups. Traffic to specific domains can be routed differently.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create domain group
set firewall group domain-group STREAMING domain &apos;netflix.com&apos;
set firewall group domain-group STREAMING domain &apos;nflxvideo.net&apos;
set firewall group domain-group STREAMING domain &apos;hulu.com&apos;

# Route streaming through ISP (not VPN) even if VPN is default
set protocols static table 30 route 0.0.0.0/0 next-hop 192.168.1.1

# Policy route: match domain group and set table
set policy route PBR-DOMAIN rule 10 destination group domain-group &apos;STREAMING&apos;
set policy route PBR-DOMAIN rule 10 set table &apos;30&apos;
set policy route PBR-DOMAIN interface &apos;eth1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Domain groups rely on DNS resolution. VyOS maintains a cache of IP addresses for the domains. This isn&apos;t perfect — CDNs change IPs, some services use many domains. But for common use cases, it works well.&lt;/p&gt;
&lt;h2&gt;Scenario 4: Combined Rules (Source + Destination)&lt;/h2&gt;
&lt;p&gt;Real-world often needs combinations: &quot;Work devices accessing work servers go through VPN, but their general browsing goes direct.&quot;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Groups
set firewall group network-group WORK-DEVICES network &apos;10.0.0.0/24&apos;
set firewall group network-group WORK-SERVERS network &apos;10.100.0.0/16&apos;

# Table for VPN
set protocols static table 40 route 0.0.0.0/0 interface wg0

# Policy route: match source AND destination, set table
set policy route PBR-WORK rule 10 source group network-group &apos;WORK-DEVICES&apos;
set policy route PBR-WORK rule 10 destination group network-group &apos;WORK-SERVERS&apos;
set policy route PBR-WORK rule 10 set table &apos;40&apos;

# Everything else from work devices goes direct (no matching rule = main table)

set policy route PBR-WORK interface &apos;eth1&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Debugging PBR&lt;/h2&gt;
&lt;p&gt;When PBR doesn&apos;t work as expected, debug systematically:&lt;/p&gt;
&lt;h3&gt;1. Verify Policy Route is Matching&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check policy route statistics
show policy route statistics
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If rules show zero packets, the match criteria isn&apos;t hitting. Check source/destination groups.&lt;/p&gt;
&lt;h3&gt;2. Verify the Routing Table Exists&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show ip route table 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Should show the route (e.g., default via wg0). If empty, the table wasn&apos;t configured correctly.&lt;/p&gt;
&lt;h3&gt;3. Verify Policy Route is Applied&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show policy route
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Confirms which interfaces have policy routing and what rules exist.&lt;/p&gt;
&lt;h3&gt;4. Trace a Specific Packet&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# From VyOS, simulate routing decision
ip route get 8.8.8.8 mark 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Shows which route would be used for marked traffic.&lt;/p&gt;
&lt;h3&gt;5. Check Actual Traffic Flow&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Monitor traffic on interfaces
sudo tcpdump -i wg0 -n host 8.8.8.8
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If traffic appears on the expected interface, PBR is working.&lt;/p&gt;
&lt;h3&gt;Common Issues&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traffic not matching&lt;/td&gt;
&lt;td&gt;Source/dest mismatch&lt;/td&gt;
&lt;td&gt;Verify group contents, check rule order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Matched but wrong route&lt;/td&gt;
&lt;td&gt;Table number mismatch&lt;/td&gt;
&lt;td&gt;Ensure table exists with correct routes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works then fails&lt;/td&gt;
&lt;td&gt;Gateway down in alternate table&lt;/td&gt;
&lt;td&gt;Add gateway monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS traffic bypasses PBR&lt;/td&gt;
&lt;td&gt;DNS resolves before routing&lt;/td&gt;
&lt;td&gt;Use domain groups or DNS on VPN&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Best Practices&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use meaningful table numbers&lt;/strong&gt;: 10 for VPN, 20 for backup ISP, etc. Document what each table is for.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Keep firewall groups organized&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;set firewall group network-group VPN-CLIENTS description &apos;Devices that always use VPN&apos;
set firewall group network-group BYPASS-VPN description &apos;Devices that never use VPN&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Test each component separately&lt;/strong&gt;: First verify the routing table works (manually add a route and test), then verify policy rules are matching, then check traffic flows correctly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Have a fallback&lt;/strong&gt;: If VPN goes down, marked traffic will blackhole. Consider adding:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Fallback route in VPN table
set protocols static table 10 route 0.0.0.0/0 next-hop 192.168.1.1 distance 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Lower distance = preferred. If wg0 route (default distance 1) fails, traffic falls back to ISP.&lt;/p&gt;
&lt;h2&gt;Complete Example: Split-Tunnel VPN&lt;/h2&gt;
&lt;p&gt;Here&apos;s a realistic full configuration. Certain devices always use VPN, streaming services bypass VPN, everything else goes direct.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# === Routing Tables ===
set protocols static table 10 route 0.0.0.0/0 interface wg0

# === Firewall Groups ===
set firewall group network-group VPN-CLIENTS network &apos;10.0.0.50/32&apos;
set firewall group network-group VPN-CLIENTS network &apos;10.0.0.51/32&apos;
set firewall group domain-group STREAMING domain &apos;netflix.com&apos;
set firewall group domain-group STREAMING domain &apos;nflxvideo.net&apos;

# === Policy Route Rules ===
# Rule order matters - exceptions first!

# Streaming from VPN clients goes direct (not through VPN)
set policy route PBR rule 5 source group network-group &apos;VPN-CLIENTS&apos;
set policy route PBR rule 5 destination group domain-group &apos;STREAMING&apos;
# No table set = uses main routing table

# VPN clients use VPN for everything else
set policy route PBR rule 10 source group network-group &apos;VPN-CLIENTS&apos;
set policy route PBR rule 10 set table &apos;10&apos;

# === Apply ===
set policy route PBR interface &apos;eth1&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rule order matters: rule 5 (streaming exception) is checked before rule 10 (VPN routing). Streaming traffic matches rule 5 with no table override, uses default routing. Everything else from VPN clients matches rule 10, uses VPN table.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;PBR isn&apos;t magic incantations. It&apos;s two clear steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Define where traffic should go (routing tables)&lt;/li&gt;
&lt;li&gt;Define what traffic to affect (policy route rules with matching criteria)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When debugging, check each step independently. Are rules matching traffic? Does the table have the right routes? Is the policy applied to the correct interface?&lt;/p&gt;
&lt;p&gt;Clear criteria + systematic debugging = PBR that works reliably.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>networking</category><category>routing</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>IPv6 at Home: RA, DHCPv6, and Why Your Firewall Keeps Breaking It</title><link>https://ashimov.com/posts/vyos-ipv6/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-ipv6/</guid><description>Practical IPv6 configuration on VyOS for home networks. Covers Router Advertisements, DHCPv6, stateless vs stateful addressing, firewall rules, and debugging ND/RA issues.</description><pubDate>Tue, 07 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;IPv6 breaks differently than IPv4. There&apos;s no NAT hiding your mistakes, no single DHCP server controlling everything, and a whole new set of protocols (RA, ND, DHCPv6) that must work together. When IPv6 stops working, it&apos;s rarely &quot;magic&quot; — it&apos;s almost always Router Advertisements or firewall rules.&lt;/p&gt;
&lt;p&gt;This guide covers practical IPv6 deployment on VyOS, including the debugging steps that will save you hours of frustration.&lt;/p&gt;
&lt;h2&gt;IPv6 Addressing: What You Actually Get&lt;/h2&gt;
&lt;p&gt;Most ISPs provide one of these:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Static prefix&lt;/strong&gt;: You get a /48 or /56 that doesn&apos;t change&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DHCPv6-PD (Prefix Delegation)&lt;/strong&gt;: Router requests a prefix dynamically&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Single /64&lt;/strong&gt;: Bare minimum, limits your options&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For home use, DHCPv6-PD is most common. Let&apos;s configure for that scenario.&lt;/p&gt;
&lt;h2&gt;WAN Configuration: Getting Your Prefix&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;configure

# Request address via DHCPv6 for WAN interface
set interfaces ethernet eth0 ipv6 address autoconf

# Request delegated prefix for LAN
set interfaces ethernet eth0 dhcpv6-options pd 0 interface eth1 sla-id &apos;0&apos;
set interfaces ethernet eth0 dhcpv6-options pd 0 length &apos;56&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;sla-id&lt;/code&gt; lets you create multiple /64s from your delegated prefix. If you get a /56, you have 256 possible /64 subnets. sla-id &apos;0&apos; means use the first one on eth1.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show interfaces ethernet eth0
show interfaces ethernet eth1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;eth0 should have a global IPv6 address. eth1 should have an address from your delegated prefix (something like 2001:db8:1234::/64 depending on your ISP).&lt;/p&gt;
&lt;h2&gt;LAN Configuration: Router Advertisements&lt;/h2&gt;
&lt;p&gt;Unlike IPv4 where DHCP does everything, IPv6 clients learn about the network through Router Advertisements (RA). The router periodically broadcasts &quot;I exist, here&apos;s the prefix, here&apos;s how to get addresses.&quot;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Enable router advertisements on LAN
set service router-advert interface eth1 prefix ::/64
set service router-advert interface eth1 name-server 2606:4700:4700::1111
set service router-advert interface eth1 name-server 2001:4860:4860::8888

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;::/64&lt;/code&gt; prefix means &quot;use whatever prefix is assigned to this interface.&quot; VyOS automatically advertises the correct prefix.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;: On a LAN client, check for IPv6 address.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Linux
ip -6 addr show

# macOS
ifconfig | grep inet6

# Windows
ipconfig | findstr IPv6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients should have a global IPv6 address (not just fe80:: link-local).&lt;/p&gt;
&lt;h2&gt;SLAAC vs DHCPv6: Understanding the Options&lt;/h2&gt;
&lt;p&gt;Two ways for clients to get IPv6 addresses:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SLAAC (Stateless Address Autoconfiguration)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Client generates its own address from the prefix&lt;/li&gt;
&lt;li&gt;No server tracks who has what address&lt;/li&gt;
&lt;li&gt;Simple, but no central lease database&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;DHCPv6 (Stateful)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Server assigns specific addresses&lt;/li&gt;
&lt;li&gt;Tracks leases like IPv4 DHCP&lt;/li&gt;
&lt;li&gt;More control, more complexity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For home networks, SLAAC is usually sufficient. The RA configuration above uses SLAAC by default.&lt;/p&gt;
&lt;p&gt;If you need DHCPv6 (for address tracking or specific assignments):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Tell clients to also use DHCPv6 for addresses
set service router-advert interface eth1 managed-flag

# DHCPv6 server
set service dhcpv6-server shared-network-name LAN subnet 2001:db8:1234::/64 range 0 start 2001:db8:1234::100
set service dhcpv6-server shared-network-name LAN subnet 2001:db8:1234::/64 range 0 stop 2001:db8:1234::1ff
set service dhcpv6-server shared-network-name LAN subnet 2001:db8:1234::/64 subnet-id 1

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note: Replace &lt;code&gt;2001:db8:1234::/64&lt;/code&gt; with your actual delegated prefix.&lt;/p&gt;
&lt;h2&gt;The Firewall Problem&lt;/h2&gt;
&lt;p&gt;Here&apos;s where most IPv6 setups break. IPv4 NAT accidentally provided security — nothing could reach internal hosts without explicit port forwards. IPv6 has no NAT (normally), so every device is directly addressable from the internet.&lt;/p&gt;
&lt;p&gt;You MUST have proper firewall rules.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# IPv6 firewall: WAN to LAN
set firewall ipv6 name WANv6-TO-LANv6 default-action &apos;drop&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 10 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state &apos;established&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state &apos;related&apos;

# ICMPv6 is REQUIRED for IPv6 to function
set firewall ipv6 name WANv6-TO-LANv6 rule 20 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 20 protocol &apos;ipv6-icmp&apos;

# IPv6 firewall: WAN to router (local)
set firewall ipv6 name WANv6-LOCAL default-action &apos;drop&apos;
set firewall ipv6 name WANv6-LOCAL rule 10 action &apos;accept&apos;
set firewall ipv6 name WANv6-LOCAL rule 10 state &apos;established&apos;
set firewall ipv6 name WANv6-LOCAL rule 10 state &apos;related&apos;
set firewall ipv6 name WANv6-LOCAL rule 20 action &apos;accept&apos;
set firewall ipv6 name WANv6-LOCAL rule 20 protocol &apos;ipv6-icmp&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 action &apos;accept&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 protocol &apos;udp&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 destination port &apos;546&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 source port &apos;547&apos;

# LAN to WAN: allow outbound
set firewall ipv6 name LANv6-TO-WANv6 default-action &apos;accept&apos;

# Apply to forward/input chains
set firewall ipv6 forward filter default-action &apos;accept&apos;
set firewall ipv6 forward filter rule 10 inbound-interface name &apos;eth0&apos;
set firewall ipv6 forward filter rule 10 action &apos;jump&apos;
set firewall ipv6 forward filter rule 10 jump-target &apos;WANv6-TO-LANv6&apos;

set firewall ipv6 input filter default-action &apos;drop&apos;
set firewall ipv6 input filter rule 10 inbound-interface name &apos;eth0&apos;
set firewall ipv6 input filter rule 10 action &apos;jump&apos;
set firewall ipv6 input filter rule 10 jump-target &apos;WANv6-LOCAL&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Critical: ICMPv6 Must Be Allowed&lt;/h3&gt;
&lt;p&gt;Unlike IPv4 where you could (unwisely) block all ICMP, IPv6 &lt;em&gt;requires&lt;/em&gt; ICMPv6 for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Neighbor Discovery (ND)&lt;/strong&gt;: IPv6&apos;s replacement for ARP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Router Advertisements&lt;/strong&gt;: How clients find the gateway&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Path MTU Discovery&lt;/strong&gt;: Essential for connectivity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Duplicate Address Detection&lt;/strong&gt;: Prevents IP conflicts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Blocking ICMPv6 = broken IPv6. Rule 20 in the firewall above allows all ICMPv6. You can be more restrictive:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# More restrictive ICMPv6 (still functional)
set firewall ipv6 name WANv6-TO-LANv6 rule 20 icmpv6 type &apos;echo-request&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 21 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 21 protocol &apos;ipv6-icmp&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 21 icmpv6 type &apos;destination-unreachable&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 22 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 22 protocol &apos;ipv6-icmp&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 22 icmpv6 type &apos;packet-too-big&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 23 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 23 protocol &apos;ipv6-icmp&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 23 icmpv6 type &apos;time-exceeded&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;NAT66: When You Actually Need It&lt;/h2&gt;
&lt;p&gt;Pure IPv6 doesn&apos;t need NAT. But sometimes you&apos;re stuck:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ISP only gives you a /64 and you need multiple subnets&lt;/li&gt;
&lt;li&gt;Privacy concerns about exposing internal addressing&lt;/li&gt;
&lt;li&gt;Translating between different IPv6 ranges&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;NAT66 (IPv6-to-IPv6 NAT) is available but should be a last resort:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# NAT66 - use only if absolutely necessary
set nat66 source rule 100 outbound-interface name &apos;eth0&apos;
set nat66 source rule 100 source prefix &apos;fd00::/64&apos;
set nat66 source rule 100 translation address &apos;masquerade&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This would NAT your ULA (fd00::/64) internal addresses to your public prefix. Again, avoid this if possible — it defeats IPv6&apos;s end-to-end connectivity benefits.&lt;/p&gt;
&lt;h2&gt;Debugging IPv6 Issues&lt;/h2&gt;
&lt;p&gt;When IPv6 breaks, here&apos;s the diagnostic flow:&lt;/p&gt;
&lt;h3&gt;1. Check Interface Addressing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show interfaces
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Both WAN and LAN need global IPv6 addresses (not just fe80:: link-local).&lt;/p&gt;
&lt;h3&gt;2. Verify Router Advertisements&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On VyOS
show ipv6 route

# On Linux client
rdisc6 eth0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If RA isn&apos;t working, clients won&apos;t get addresses or know the default gateway.&lt;/p&gt;
&lt;h3&gt;3. Check Neighbor Discovery&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On VyOS
show ipv6 neighbors

# On Linux client
ip -6 neigh show
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;ND is like ARP for IPv6. Missing entries mean L2 connectivity issues or firewall blocking.&lt;/p&gt;
&lt;h3&gt;4. Test Connectivity Layer by Layer&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# From VyOS: can we reach the internet?
ping 2600:: -c 3

# From client: can we reach the gateway?
ping6 fe80::1%eth0

# From client: can we reach the internet?
ping6 google.com
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. Check Firewall Counters&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;show firewall ipv6 name WANv6-TO-LANv6
show firewall ipv6 name WANv6-LOCAL
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;High drop counts on specific rules indicate what&apos;s being blocked.&lt;/p&gt;
&lt;h3&gt;Common Issues and Fixes&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No global address on LAN client&lt;/td&gt;
&lt;td&gt;RA not working&lt;/td&gt;
&lt;td&gt;Check router-advert config, verify eth1 has global address&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can ping gateway but not internet&lt;/td&gt;
&lt;td&gt;Missing default route or firewall&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;show ipv6 route&lt;/code&gt;, verify firewall allows outbound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intermittent connectivity&lt;/td&gt;
&lt;td&gt;ICMPv6 blocked&lt;/td&gt;
&lt;td&gt;Allow ICMPv6 in firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works then stops after minutes&lt;/td&gt;
&lt;td&gt;DAD failure or RA timeout&lt;/td&gt;
&lt;td&gt;Check RA interval, look for duplicate addresses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DHCPv6 not assigning addresses&lt;/td&gt;
&lt;td&gt;Missing managed-flag in RA&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;managed-flag&lt;/code&gt; on router-advert interface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Privacy Extensions&lt;/h2&gt;
&lt;p&gt;By default, SLAAC creates addresses based on MAC address — potentially trackable. Modern systems use Privacy Extensions (RFC 4941) to generate random addresses.&lt;/p&gt;
&lt;p&gt;VyOS can control this via RA:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Suggest clients use privacy addresses
set service router-advert interface eth1 prefix ::/64 preferred-lifetime &apos;14400&apos;
set service router-advert interface eth1 prefix ::/64 valid-lifetime &apos;86400&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Shorter lifetimes encourage address rotation. Client OS controls whether to actually use privacy extensions.&lt;/p&gt;
&lt;h2&gt;Complete IPv6 Configuration&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# WAN: DHCPv6-PD
set interfaces ethernet eth0 ipv6 address autoconf
set interfaces ethernet eth0 dhcpv6-options pd 0 interface eth1 sla-id &apos;0&apos;
set interfaces ethernet eth0 dhcpv6-options pd 0 length &apos;56&apos;

# LAN: Router Advertisements
set service router-advert interface eth1 prefix ::/64
set service router-advert interface eth1 name-server 2606:4700:4700::1111
set service router-advert interface eth1 name-server 2001:4860:4860::8888

# Firewall: WAN inbound
set firewall ipv6 name WANv6-TO-LANv6 default-action &apos;drop&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 10 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state &apos;established&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state &apos;related&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 20 action &apos;accept&apos;
set firewall ipv6 name WANv6-TO-LANv6 rule 20 protocol &apos;ipv6-icmp&apos;

# Firewall: WAN to local
set firewall ipv6 name WANv6-LOCAL default-action &apos;drop&apos;
set firewall ipv6 name WANv6-LOCAL rule 10 action &apos;accept&apos;
set firewall ipv6 name WANv6-LOCAL rule 10 state &apos;established&apos;
set firewall ipv6 name WANv6-LOCAL rule 10 state &apos;related&apos;
set firewall ipv6 name WANv6-LOCAL rule 20 action &apos;accept&apos;
set firewall ipv6 name WANv6-LOCAL rule 20 protocol &apos;ipv6-icmp&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 action &apos;accept&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 protocol &apos;udp&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 destination port &apos;546&apos;
set firewall ipv6 name WANv6-LOCAL rule 30 source port &apos;547&apos;

# Firewall: LAN outbound
set firewall ipv6 name LANv6-TO-WANv6 default-action &apos;accept&apos;

# Apply firewall to forward/input chains
set firewall ipv6 forward filter default-action &apos;accept&apos;
set firewall ipv6 forward filter rule 10 inbound-interface name &apos;eth0&apos;
set firewall ipv6 forward filter rule 10 action &apos;jump&apos;
set firewall ipv6 forward filter rule 10 jump-target &apos;WANv6-TO-LANv6&apos;

set firewall ipv6 input filter default-action &apos;drop&apos;
set firewall ipv6 input filter rule 10 inbound-interface name &apos;eth0&apos;
set firewall ipv6 input filter rule 10 action &apos;jump&apos;
set firewall ipv6 input filter rule 10 jump-target &apos;WANv6-LOCAL&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;IPv6 doesn&apos;t break mysteriously. When it fails, check in order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;RA configuration&lt;/strong&gt;: Is the router advertising the prefix?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Firewall rules&lt;/strong&gt;: Is ICMPv6 allowed? Is DHCPv6 (port 546/547) allowed for WAN-local?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prefix delegation&lt;/strong&gt;: Did the router actually receive a prefix from the ISP?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Once you understand that RA replaces much of what DHCP does in IPv4, and that ICMPv6 is mandatory (not optional), IPv6 becomes predictable. The debugging is different, but the methodology is the same: verify each layer, check what&apos;s being blocked, and read the counters.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>homelab</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>VyOS Isn&apos;t Scary: Building Your First Production-Ready Router</title><link>https://ashimov.com/posts/vyos-basics/</link><guid isPermaLink="true">https://ashimov.com/posts/vyos-basics/</guid><description>A practical guide to setting up VyOS from scratch. Covers WAN/LAN configuration, NAT, DHCP, DNS forwarding, and basic firewall rules with validation at every step.</description><pubDate>Fri, 03 Oct 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;VyOS has a reputation for being intimidating. The CLI-only interface, the commit/save model, the sheer number of configuration options — it can feel overwhelming. But here&apos;s the thing: VyOS isn&apos;t complicated, it&apos;s just explicit. Every setting you&apos;d configure through a consumer router&apos;s web UI exists here too, just visible and version-controllable.&lt;/p&gt;
&lt;p&gt;This guide walks through building a basic but production-ready router configuration. We&apos;ll validate each piece before moving to the next. By the end, you&apos;ll have a working router and the confidence to extend it.&lt;/p&gt;
&lt;h2&gt;The Mental Model&lt;/h2&gt;
&lt;p&gt;Before touching any commands, understand VyOS&apos;s configuration model:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Configuration tree&lt;/strong&gt;: Settings are organized hierarchically (like a filesystem)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Edit sessions&lt;/strong&gt;: Changes are staged, then committed atomically&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rollback capability&lt;/strong&gt;: Bad commit? Roll back instantly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Show vs Configure mode&lt;/strong&gt;: &lt;code&gt;show&lt;/code&gt; reads running state, &lt;code&gt;configure&lt;/code&gt; modifies it&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# Enter configuration mode
configure

# Make changes (staged, not active yet)
set interfaces ethernet eth0 address 192.168.1.1/24

# See what would change
compare

# Apply changes atomically
commit

# Persist across reboots
save

# Exit configuration mode
exit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This model prevents half-applied configurations. Either everything commits successfully, or nothing changes.&lt;/p&gt;
&lt;h2&gt;Initial Setup: Interfaces&lt;/h2&gt;
&lt;p&gt;Let&apos;s assume a typical setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;eth0&lt;/code&gt;: WAN (gets address via DHCP from ISP)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;eth1&lt;/code&gt;: LAN (our internal network, 10.0.0.0/24)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;configure

# WAN interface - DHCP from ISP
set interfaces ethernet eth0 description &apos;WAN&apos;
set interfaces ethernet eth0 address dhcp

# LAN interface - static address, this router is the gateway
set interfaces ethernet eth1 description &apos;LAN&apos;
set interfaces ethernet eth1 address &apos;10.0.0.1/24&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;: Check interfaces are up and addressed correctly.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show interfaces
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see &lt;code&gt;eth0&lt;/code&gt; with an address from your ISP and &lt;code&gt;eth1&lt;/code&gt; with &lt;code&gt;10.0.0.1/24&lt;/code&gt;. If &lt;code&gt;eth0&lt;/code&gt; shows no address, check cable and ISP connectivity.&lt;/p&gt;
&lt;h2&gt;NAT: Making LAN Traffic Reach the Internet&lt;/h2&gt;
&lt;p&gt;Without NAT, your LAN devices can&apos;t reach the internet — their private IPs aren&apos;t routable. We need source NAT (masquerade) on outbound traffic.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Source NAT for outbound traffic
set nat source rule 100 outbound-interface name &apos;eth0&apos;
set nat source rule 100 source address &apos;10.0.0.0/24&apos;
set nat source rule 100 translation address &apos;masquerade&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;: From a LAN device with manual IP (10.0.0.100, gateway 10.0.0.1), try pinging 8.8.8.8.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On VyOS, check NAT is working
show nat source statistics
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If pings fail, verify:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;LAN device has correct gateway (10.0.0.1)&lt;/li&gt;
&lt;li&gt;VyOS can ping 8.8.8.8 itself (routing works)&lt;/li&gt;
&lt;li&gt;NAT rule matches your LAN subnet&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;DHCP: Automatic Addressing for LAN Clients&lt;/h2&gt;
&lt;p&gt;Manual IPs work for testing, but clients need DHCP.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# DHCP server for LAN
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 subnet-id 1
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 start &apos;10.0.0.100&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 stop &apos;10.0.0.254&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 default-router &apos;10.0.0.1&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 name-server &apos;10.0.0.1&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 lease &apos;86400&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;: Release/renew DHCP on a LAN client, verify it gets an address in the 10.0.0.100-254 range.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;show dhcp server leases
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;DNS Forwarding: Local Resolution&lt;/h2&gt;
&lt;p&gt;Clients point to 10.0.0.1 for DNS. VyOS needs to forward those queries upstream.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# DNS forwarding
set service dns forwarding cache-size &apos;1000&apos;
set service dns forwarding listen-address &apos;10.0.0.1&apos;
set service dns forwarding allow-from &apos;10.0.0.0/24&apos;

# Use ISP&apos;s DNS or public resolvers
set service dns forwarding name-server &apos;1.1.1.1&apos;
set service dns forwarding name-server &apos;8.8.8.8&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;: From LAN client, resolve a domain.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On VyOS
show dns forwarding statistics
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this point, LAN clients should have full internet access with automatic addressing. Test by browsing from a client device.&lt;/p&gt;
&lt;h2&gt;Firewall: The Foundation&lt;/h2&gt;
&lt;p&gt;VyOS firewall works on zone/interface pairs. The mental model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Traffic flows between zones (WAN zone, LAN zone)&lt;/li&gt;
&lt;li&gt;Rules apply to traffic direction (in, out, local)&lt;/li&gt;
&lt;li&gt;Default policy should be drop (explicit allow)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Modern VyOS (1.4+) uses a zone-based approach. Here&apos;s a clean baseline:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;configure

# Create firewall groups for organization
set firewall group network-group LAN-NETS network &apos;10.0.0.0/24&apos;

# WAN to LAN: only established/related connections (no inbound initiation)
set firewall ipv4 name WAN-TO-LAN default-action &apos;drop&apos;
set firewall ipv4 name WAN-TO-LAN rule 10 action &apos;accept&apos;
set firewall ipv4 name WAN-TO-LAN rule 10 state &apos;established&apos;
set firewall ipv4 name WAN-TO-LAN rule 10 state &apos;related&apos;

# WAN to router itself (local): very restrictive
set firewall ipv4 name WAN-LOCAL default-action &apos;drop&apos;
set firewall ipv4 name WAN-LOCAL rule 10 action &apos;accept&apos;
set firewall ipv4 name WAN-LOCAL rule 10 state &apos;established&apos;
set firewall ipv4 name WAN-LOCAL rule 10 state &apos;related&apos;

# LAN to WAN: allow all outbound (NAT handles the rest)
set firewall ipv4 name LAN-TO-WAN default-action &apos;accept&apos;

# LAN to router: allow DHCP, DNS, SSH
set firewall ipv4 name LAN-LOCAL default-action &apos;drop&apos;
set firewall ipv4 name LAN-LOCAL rule 10 action &apos;accept&apos;
set firewall ipv4 name LAN-LOCAL rule 10 state &apos;established&apos;
set firewall ipv4 name LAN-LOCAL rule 10 state &apos;related&apos;
set firewall ipv4 name LAN-LOCAL rule 20 action &apos;accept&apos;
set firewall ipv4 name LAN-LOCAL rule 20 protocol &apos;udp&apos;
set firewall ipv4 name LAN-LOCAL rule 20 destination port &apos;67,68&apos;
set firewall ipv4 name LAN-LOCAL rule 30 action &apos;accept&apos;
set firewall ipv4 name LAN-LOCAL rule 30 protocol &apos;udp&apos;
set firewall ipv4 name LAN-LOCAL rule 30 destination port &apos;53&apos;
set firewall ipv4 name LAN-LOCAL rule 40 action &apos;accept&apos;
set firewall ipv4 name LAN-LOCAL rule 40 protocol &apos;tcp&apos;
set firewall ipv4 name LAN-LOCAL rule 40 destination port &apos;53&apos;
set firewall ipv4 name LAN-LOCAL rule 50 action &apos;accept&apos;
set firewall ipv4 name LAN-LOCAL rule 50 protocol &apos;tcp&apos;
set firewall ipv4 name LAN-LOCAL rule 50 destination port &apos;22&apos;
set firewall ipv4 name LAN-LOCAL rule 50 source group network-group &apos;LAN-NETS&apos;

# Apply firewall to forward chain (traffic passing through router)
set firewall ipv4 forward filter default-action &apos;accept&apos;
set firewall ipv4 forward filter rule 10 inbound-interface name &apos;eth0&apos;
set firewall ipv4 forward filter rule 10 action &apos;jump&apos;
set firewall ipv4 forward filter rule 10 jump-target &apos;WAN-TO-LAN&apos;

# Apply firewall to input chain (traffic TO the router)
set firewall ipv4 input filter default-action &apos;drop&apos;
set firewall ipv4 input filter rule 10 inbound-interface name &apos;eth0&apos;
set firewall ipv4 input filter rule 10 action &apos;jump&apos;
set firewall ipv4 input filter rule 10 jump-target &apos;WAN-LOCAL&apos;
set firewall ipv4 input filter rule 20 inbound-interface name &apos;eth1&apos;
set firewall ipv4 input filter rule 20 action &apos;jump&apos;
set firewall ipv4 input filter rule 20 jump-target &apos;LAN-LOCAL&apos;

commit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Validation step&lt;/strong&gt;: Test each service still works after firewall is applied.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check firewall rule hit counts
show firewall

# From LAN: verify DNS, DHCP, internet access
# From WAN: verify no response to unsolicited connections
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Systematic Validation Checklist&lt;/h2&gt;
&lt;p&gt;Before calling this &quot;done&quot;, verify each component:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Expected Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;WAN connectivity&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ping 8.8.8.8&lt;/code&gt; from VyOS&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS on VyOS&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ping google.com&lt;/code&gt; from VyOS&lt;/td&gt;
&lt;td&gt;Resolves and pings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LAN addressing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;show interfaces&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;eth1 has 10.0.0.1/24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DHCP&lt;/td&gt;
&lt;td&gt;Client gets address&lt;/td&gt;
&lt;td&gt;IP in 10.0.0.100-254&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NAT&lt;/td&gt;
&lt;td&gt;Client pings 8.8.8.8&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS forwarding&lt;/td&gt;
&lt;td&gt;Client resolves domains&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Firewall (LAN→WAN)&lt;/td&gt;
&lt;td&gt;Client browses internet&lt;/td&gt;
&lt;td&gt;Success&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Firewall (WAN→LAN)&lt;/td&gt;
&lt;td&gt;External port scan&lt;/td&gt;
&lt;td&gt;All filtered&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;The Complete Configuration&lt;/h2&gt;
&lt;p&gt;Here&apos;s everything in one block for reference:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Interfaces
set interfaces ethernet eth0 description &apos;WAN&apos;
set interfaces ethernet eth0 address dhcp
set interfaces ethernet eth1 description &apos;LAN&apos;
set interfaces ethernet eth1 address &apos;10.0.0.1/24&apos;

# NAT
set nat source rule 100 outbound-interface name &apos;eth0&apos;
set nat source rule 100 source address &apos;10.0.0.0/24&apos;
set nat source rule 100 translation address &apos;masquerade&apos;

# DHCP
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 subnet-id 1
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 start &apos;10.0.0.100&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 stop &apos;10.0.0.254&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 default-router &apos;10.0.0.1&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 name-server &apos;10.0.0.1&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 lease &apos;86400&apos;

# DNS
set service dns forwarding cache-size &apos;1000&apos;
set service dns forwarding listen-address &apos;10.0.0.1&apos;
set service dns forwarding allow-from &apos;10.0.0.0/24&apos;
set service dns forwarding name-server &apos;1.1.1.1&apos;
set service dns forwarding name-server &apos;8.8.8.8&apos;

# Firewall (see detailed rules above)
# ... firewall rules ...

# System basics
set system host-name &apos;router&apos;
set system name-server &apos;1.1.1.1&apos;
set system time-zone &apos;UTC&apos;

# SSH access (LAN only via firewall)
set service ssh port &apos;22&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What&apos;s Next&lt;/h2&gt;
&lt;p&gt;This configuration handles basic routing, but there&apos;s more to explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;IPv6&lt;/strong&gt;: Modern networks should support it (covered in the next article)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VLANs&lt;/strong&gt;: Segment your network further&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VPN&lt;/strong&gt;: WireGuard or IPsec for remote access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;: Logs and metrics for visibility&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key lesson from this exercise: &lt;strong&gt;build the foundation first, validate each piece, then extend&lt;/strong&gt;. VyOS rewards methodical configuration. When something breaks later, you&apos;ll know exactly which commit introduced the problem.&lt;/p&gt;
&lt;p&gt;Save your configuration, export it (&lt;code&gt;show configuration commands &amp;gt; config.txt&lt;/code&gt;), and version control it. Your router is now reproducible.&lt;/p&gt;
</content:encoded><category>vyos</category><category>firewall</category><category>homelab</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>GPU / PCI Passthrough: The Path That Works (and What Breaks It)</title><link>https://ashimov.com/posts/proxmox-gpu-passthrough/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-gpu-passthrough/</guid><description>Complete guide to GPU and PCI passthrough on Proxmox. Covers IOMMU setup, ACS override, VFIO configuration, driver binding, common issues, and why passthrough is hardware compatibility plus attention to detail.</description><pubDate>Tue, 30 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;GPU passthrough lets a VM directly access a physical GPU. No emulation, no virtualization overhead — the VM sees real hardware and gets real performance. It&apos;s the only way to run GPU workloads (gaming, machine learning, transcoding) in VMs without massive performance loss.&lt;/p&gt;
&lt;p&gt;It&apos;s also one of the most finicky things to configure. Hardware compatibility, IOMMU groups, driver issues — any of these can break passthrough completely. When it works, it&apos;s magical. When it doesn&apos;t, debugging is painful.&lt;/p&gt;
&lt;p&gt;Passthrough is hardware compatibility plus attention to detail.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;h3&gt;Hardware Requirements&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;CPU:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Intel: VT-d (IOMMU support)&lt;/li&gt;
&lt;li&gt;AMD: AMD-Vi (IOMMU support)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Check BIOS for &quot;VT-d,&quot; &quot;AMD-Vi,&quot; &quot;IOMMU,&quot; or &quot;Virtualization Technology for Directed I/O.&quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Motherboard:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Must support IOMMU&lt;/li&gt;
&lt;li&gt;Consumer boards often have poor IOMMU groups&lt;/li&gt;
&lt;li&gt;Server/workstation boards usually work better&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;GPU:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Most discrete GPUs work&lt;/li&gt;
&lt;li&gt;NVIDIA consumer cards have &quot;Code 43&quot; issues (we&apos;ll address)&lt;/li&gt;
&lt;li&gt;AMD cards generally work well&lt;/li&gt;
&lt;li&gt;Intel integrated graphics: limited support&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Check IOMMU Support&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Intel
dmesg | grep -e DMAR -e IOMMU

# AMD
dmesg | grep AMD-Vi

# Should see messages like:
# DMAR: IOMMU enabled
# AMD-Vi: Enabling IOMMU
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If no messages, enable IOMMU in BIOS.&lt;/p&gt;
&lt;h2&gt;Enable IOMMU&lt;/h2&gt;
&lt;h3&gt;Edit GRUB Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Edit GRUB config
nano /etc/default/grub
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Intel:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;GRUB_CMDLINE_LINUX_DEFAULT=&quot;quiet intel_iommu=on iommu=pt&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For AMD:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;GRUB_CMDLINE_LINUX_DEFAULT=&quot;quiet amd_iommu=on iommu=pt&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply changes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;update-grub
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify IOMMU Enabled&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;dmesg | grep -e DMAR -e IOMMU -e AMD-Vi

# Should see:
# DMAR: IOMMU enabled
# or
# AMD-Vi: AMD IOMMUv2 loaded
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;IOMMU Groups&lt;/h2&gt;
&lt;p&gt;IOMMU groups are sets of devices that must be passed through together. You can&apos;t pass a single device if it&apos;s in a group with other devices.&lt;/p&gt;
&lt;h3&gt;View IOMMU Groups&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# Save as /root/iommu-groups.sh
shopt -s nullglob
for g in $(find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V); do
    echo &quot;IOMMU Group ${g##*/}:&quot;
    for d in $g/devices/*; do
        echo -e &quot;\t$(lspci -nns ${d##*/})&quot;
    done
done
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Example output:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;IOMMU Group 1:
    00:01.0 PCI bridge [0604]: Intel Corporation...
    01:00.0 VGA compatible controller [0300]: NVIDIA Corporation... [10de:2204]
    01:00.1 Audio device [0403]: NVIDIA Corporation... [10de:1aef]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The GPU (01:00.0) and its audio device (01:00.1) are in the same group. You must pass both.&lt;/p&gt;
&lt;h3&gt;ACS Override (If Needed)&lt;/h3&gt;
&lt;p&gt;Poor IOMMU grouping (everything in one group) can be fixed with ACS override patch. &lt;strong&gt;Use with caution&lt;/strong&gt; — it reduces isolation security.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Add to GRUB
GRUB_CMDLINE_LINUX_DEFAULT=&quot;quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction&quot;

update-grub
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After reboot, check groups again. They should be smaller.&lt;/p&gt;
&lt;h2&gt;VFIO Configuration&lt;/h2&gt;
&lt;p&gt;VFIO (Virtual Function I/O) binds devices for passthrough.&lt;/p&gt;
&lt;h3&gt;Identify Device IDs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;lspci -nn | grep -i nvidia
# 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204]
# 01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Device IDs: &lt;code&gt;10de:2204&lt;/code&gt; (GPU), &lt;code&gt;10de:1aef&lt;/code&gt; (Audio)&lt;/p&gt;
&lt;h3&gt;Configure VFIO&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Add VFIO modules
echo &quot;vfio&quot; &amp;gt;&amp;gt; /etc/modules
echo &quot;vfio_iommu_type1&quot; &amp;gt;&amp;gt; /etc/modules
echo &quot;vfio_pci&quot; &amp;gt;&amp;gt; /etc/modules
echo &quot;vfio_virqfd&quot; &amp;gt;&amp;gt; /etc/modules

# Bind devices to VFIO (use YOUR device IDs)
echo &quot;options vfio-pci ids=10de:2204,10de:1aef disable_vga=1&quot; &amp;gt; /etc/modprobe.d/vfio.conf

# Blacklist host drivers (so host doesn&apos;t grab GPU)
echo &quot;blacklist nouveau&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf
echo &quot;blacklist nvidia&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf
echo &quot;blacklist nvidia_drm&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf
echo &quot;blacklist nvidiafb&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf

# For AMD
echo &quot;blacklist amdgpu&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf
echo &quot;blacklist radeon&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf

# Update initramfs
update-initramfs -u

# Reboot
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verify VFIO Binding&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;lspci -nnk -s 01:00
# Should show:
# Kernel driver in use: vfio-pci
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If it shows nvidia or nouveau, the blacklist didn&apos;t work. Check modprobe configuration.&lt;/p&gt;
&lt;h2&gt;Create VM with GPU Passthrough&lt;/h2&gt;
&lt;h3&gt;VM Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create VM
qm create 100 --name gpu-vm --memory 16384 --cores 8 --sockets 1 \
  --bios ovmf --machine q35 \
  --net0 virtio,bridge=vmbr0 \
  --scsihw virtio-scsi-pci

# Add EFI disk
qm set 100 --efidisk0 local-zfs:1,format=raw

# Add main disk
qm set 100 --scsi0 local-zfs:100,ssd=1

# CPU settings (important for passthrough)
qm set 100 --cpu host,hidden=1,flags=+pcid

# Add PCI devices
qm set 100 --hostpci0 01:00,pcie=1,x-vga=1

# Add audio device if in same IOMMU group
qm set 100 --hostpci1 01:00.1,pcie=1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Important Settings&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;BIOS: OVMF (UEFI)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Required for modern GPUs&lt;/li&gt;
&lt;li&gt;Enables PCI passthrough features&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Machine: q35&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Modern chipset with proper PCIe support&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;CPU: host,hidden=1&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;host&lt;/code&gt;: Pass through all CPU features&lt;/li&gt;
&lt;li&gt;&lt;code&gt;hidden=1&lt;/code&gt;: Hide hypervisor from VM (needed for NVIDIA)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;hostpci: pcie=1,x-vga=1&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pcie=1&lt;/code&gt;: Use PCIe mode (required)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;x-vga=1&lt;/code&gt;: Primary graphics device&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;NVIDIA-Specific Fixes&lt;/h2&gt;
&lt;p&gt;NVIDIA drivers detect virtualization and refuse to work (&quot;Error 43&quot;). Several workarounds:&lt;/p&gt;
&lt;h3&gt;Hide Hypervisor&lt;/h3&gt;
&lt;p&gt;Already done with &lt;code&gt;cpu: host,hidden=1&lt;/code&gt;. For older NVIDIA drivers, you may also need:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# In /etc/pve/qemu-server/100.conf, add to args (if Error 43 persists):
args: -cpu &apos;host,hv_vendor_id=NV43FIX,+kvm_pv_unhalt,+kvm_pv_eoi&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note: Modern NVIDIA drivers (535+) usually work with just &lt;code&gt;hidden=1&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Vendor ID Spoofing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Add to VM config
qm set 100 --args &quot;-cpu &apos;host,hv_vendor_id=randomid&apos;&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;ROM File (Sometimes Needed)&lt;/h3&gt;
&lt;p&gt;Some GPUs need their VBIOS dumped and passed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Dump ROM (from another system or Windows)
# Or download from TechPowerUp

# Add to VM config
qm set 100 --hostpci0 01:00,pcie=1,x-vga=1,romfile=gpu.rom
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Place ROM file in &lt;code&gt;/usr/share/kvm/&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;AMD GPU Passthrough&lt;/h2&gt;
&lt;p&gt;AMD is generally easier:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# VFIO config (use AMD device IDs)
echo &quot;options vfio-pci ids=1002:xxxx,1002:xxxx&quot; &amp;gt; /etc/modprobe.d/vfio.conf

# Blacklist
echo &quot;blacklist amdgpu&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf
echo &quot;blacklist radeon&quot; &amp;gt;&amp;gt; /etc/modprobe.d/blacklist.conf

update-initramfs -u
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;AMD GPUs usually work without additional tweaks.&lt;/p&gt;
&lt;h3&gt;AMD Reset Bug&lt;/h3&gt;
&lt;p&gt;Some AMD GPUs (Polaris, Navi) have reset bugs — VM shutdown leaves GPU in bad state, requiring host reboot.&lt;/p&gt;
&lt;p&gt;Workaround:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Install build dependencies
apt install pve-headers-$(uname -r) git build-essential dkms

# Clone and build vendor-reset module
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset
dkms install .

# Verify module loads
modprobe vendor-reset

# Make persistent after successful test
echo &quot;vendor-reset&quot; &amp;gt;&amp;gt; /etc/modules
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Other PCI Devices&lt;/h2&gt;
&lt;p&gt;Passthrough works for any PCI device, not just GPUs:&lt;/p&gt;
&lt;h3&gt;USB Controller&lt;/h3&gt;
&lt;p&gt;Pass entire USB controller for low-latency USB:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Find USB controller
lspci | grep USB

# Pass through
qm set 100 --hostpci2 00:14.0,pcie=1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;NVMe Controller&lt;/h3&gt;
&lt;p&gt;Pass NVMe for direct storage access:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Find NVMe
lspci | grep NVMe

# Pass through
qm set 100 --hostpci3 03:00.0,pcie=1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Network Card&lt;/h3&gt;
&lt;p&gt;Pass dedicated NIC:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm set 100 --hostpci4 04:00.0,pcie=1
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Device Not Bound to VFIO&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check current driver
lspci -nnk -s 01:00

# If not vfio-pci:
# 1. Check blacklist
cat /etc/modprobe.d/blacklist.conf

# 2. Check VFIO config
cat /etc/modprobe.d/vfio.conf

# 3. Rebuild initramfs
update-initramfs -u -k all

# 4. Reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VM Won&apos;t Start&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check QEMU log
cat /var/log/pve/qemu-server/100.log

# Common issues:
# - IOMMU not enabled
# - Device in use by host
# - Wrong device ID
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;No Display Output&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify x-vga=1 is set
grep hostpci /etc/pve/qemu-server/100.conf

# Try different video output
# Monitor on GPU should show VM boot

# Check if VM is actually running
qm status 100
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;NVIDIA Error 43&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify hidden flag
grep cpu /etc/pve/qemu-server/100.conf
# Should include hidden=1

# Try vendor ID spoof
# Add to args: hv_vendor_id=NV43FIX

# Ensure BIOS is OVMF, not SeaBIOS
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Poor Performance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Inside VM:
# Check if GPU is using correct driver
nvidia-smi  # Should show GPU

# Check PCIe link speed
lspci -vv -s 01:00 | grep -i width
# Should show x16 or at least x8

# Ensure cpu type is &apos;host&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Single GPU Passthrough&lt;/h2&gt;
&lt;p&gt;Using your only GPU in a VM (no display on host):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Host boots headless
# VM gets GPU
# Reconnect display to VM

# Challenges:
# - Host has no display
# - Must manage via SSH/remote
# - GPU must unbind from host console

# Scripts needed for:
# 1. Unbind GPU from host
# 2. Start VM
# 3. Stop VM
# 4. Rebind GPU to host
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is complex. Search for &quot;single GPU passthrough scripts&quot; for examples.&lt;/p&gt;
&lt;h2&gt;Live Migration Limitations&lt;/h2&gt;
&lt;p&gt;VMs with passthrough &lt;strong&gt;cannot live migrate&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hardware is physically on one host&lt;/li&gt;
&lt;li&gt;Must stop VM, move, start on new host&lt;/li&gt;
&lt;li&gt;Not compatible with HA auto-failover&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Plan accordingly: critical passthrough VMs can&apos;t be HA.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Passthrough is hardware compatibility plus attention to detail.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When passthrough doesn&apos;t work, it&apos;s usually:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;IOMMU not enabled&lt;/strong&gt;: Check BIOS and kernel parameters&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bad IOMMU groups&lt;/strong&gt;: ACS override or different hardware&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Driver conflict&lt;/strong&gt;: Host driver grabs device before VFIO&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NVIDIA detection&lt;/strong&gt;: Hidden flags and vendor ID spoof&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reset bugs&lt;/strong&gt;: AMD GPUs need vendor-reset module&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The debugging process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Verify IOMMU enabled (&lt;code&gt;dmesg | grep IOMMU&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Check IOMMU groups (all needed devices in one group)&lt;/li&gt;
&lt;li&gt;Verify VFIO binding (&lt;code&gt;lspci -nnk&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Check VM logs (&lt;code&gt;/var/log/pve/qemu-server/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Try minimal config, add features one at a time&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Passthrough isn&apos;t guaranteed to work with all hardware. Some motherboards have terrible IOMMU groups. Some GPUs have bugs. Do research before buying hardware specifically for passthrough.&lt;/p&gt;
&lt;p&gt;When it works, you get bare-metal GPU performance in a VM. When it doesn&apos;t, you need patience and systematic debugging. There&apos;s no magic fix — just working through each requirement methodically.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>passthrough</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Performance Clinic: CPU Pinning, Hugepages, VirtIO, and Storage Tuning</title><link>https://ashimov.com/posts/proxmox-performance/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-performance/</guid><description>Proxmox performance optimization guide. Covers VirtIO drivers, cache modes, IO threads, NUMA awareness, hugepages, and why optimization starts with measurement, not tweaking.</description><pubDate>Fri, 26 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Performance tuning is seductive. Forums are full of &quot;enable this setting for 20% more speed.&quot; Most of it is cargo culting — copying settings without understanding why.&lt;/p&gt;
&lt;p&gt;Real performance optimization follows a process: measure, identify bottleneck, address bottleneck, measure again. Tweaking random settings without measuring is just superstition.&lt;/p&gt;
&lt;p&gt;Optimization starts with measurement, not with tweaks.&lt;/p&gt;
&lt;h2&gt;Measure First&lt;/h2&gt;
&lt;p&gt;Before changing anything, understand your current performance.&lt;/p&gt;
&lt;h3&gt;Host Metrics&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Overall system performance
htop

# CPU usage per core
mpstat -P ALL 1

# Memory usage
free -h
vmstat 1

# Disk I/O
iostat -xz 1

# Network
iftop -i vmbr0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VM Performance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Inside VM: Check for virtualization overhead
# CPU steal time (other VMs taking your CPU)
top  # Look at %st column

# Disk latency
iostat -x 1

# From host: VM-specific metrics
qm monitor 100
info cpus
info block
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benchmark Tools&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# CPU benchmark
apt install sysbench
sysbench cpu run

# Disk benchmark
apt install fio

# Random 4K (database-like)
fio --name=rand --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=30 --group_reporting

# Sequential (large file)
fio --name=seq --ioengine=libaio --iodepth=1 --rw=read --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=30 --group_reporting

# Network benchmark (between VMs)
apt install iperf3
# Server: iperf3 -s
# Client: iperf3 -c &amp;lt;server-ip&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;VirtIO Drivers&lt;/h2&gt;
&lt;p&gt;VirtIO is paravirtualized I/O. Instead of emulating real hardware, the VM knows it&apos;s virtualized and uses optimized drivers.&lt;/p&gt;
&lt;h3&gt;Performance Impact&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Device&lt;/th&gt;
&lt;th&gt;Emulated&lt;/th&gt;
&lt;th&gt;VirtIO&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;E1000: ~1 Gbps&lt;/td&gt;
&lt;td&gt;virtio-net: 10+ Gbps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;IDE: slow, high CPU&lt;/td&gt;
&lt;td&gt;virtio-blk: fast, low CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Display&lt;/td&gt;
&lt;td&gt;VGA: basic&lt;/td&gt;
&lt;td&gt;virtio-gpu: better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Configuring VirtIO&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Disk: Use virtio-scsi controller
qm set 100 --scsihw virtio-scsi-pci
qm set 100 --scsi0 local-zfs:vm-100-disk-0

# Network: Use virtio
qm set 100 --net0 virtio,bridge=vmbr0

# Display: Use virtio (Linux VMs)
qm set 100 --vga virtio
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Windows VirtIO Drivers&lt;/h3&gt;
&lt;p&gt;Windows doesn&apos;t include VirtIO drivers. Install them:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Download ISO from Fedora: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/&lt;/li&gt;
&lt;li&gt;Attach ISO to VM&lt;/li&gt;
&lt;li&gt;During Windows install: Load driver from ISO&lt;/li&gt;
&lt;li&gt;After install: Run &lt;code&gt;virtio-win-gt-x64.msi&lt;/code&gt; for guest tools&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Storage Cache Modes&lt;/h2&gt;
&lt;p&gt;Cache mode affects performance vs. data safety:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Safety&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Safe&lt;/td&gt;
&lt;td&gt;Production (default)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;writeback&lt;/td&gt;
&lt;td&gt;Fastest&lt;/td&gt;
&lt;td&gt;Less safe&lt;/td&gt;
&lt;td&gt;Benchmarks, non-critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;writethrough&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;td&gt;Safest&lt;/td&gt;
&lt;td&gt;Critical data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;directsync&lt;/td&gt;
&lt;td&gt;Slowest&lt;/td&gt;
&lt;td&gt;Safest&lt;/td&gt;
&lt;td&gt;Maximum safety&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Configure Cache&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# No cache (recommended for production)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=none

# Writeback (faster, less safe)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=writeback

# With ZFS, cache=none is usually best
# ZFS has its own caching (ARC)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use Writeback&lt;/h3&gt;
&lt;p&gt;Only with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Battery-backed write cache (enterprise storage)&lt;/li&gt;
&lt;li&gt;Non-critical VMs (dev, test)&lt;/li&gt;
&lt;li&gt;Understanding that power loss = potential data loss&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;IO Threads&lt;/h2&gt;
&lt;p&gt;By default, all VM disk I/O goes through one QEMU thread. With IO threads, each disk gets its own thread.&lt;/p&gt;
&lt;h3&gt;Enable IO Threads&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable iothread for disk
qm set 100 --scsi0 local-zfs:vm-100-disk-0,iothread=1

# For multiple disks, each can have its own thread
qm set 100 --scsi1 local-zfs:vm-100-disk-1,iothread=1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When IO Threads Help&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Multiple disks per VM&lt;/li&gt;
&lt;li&gt;High IOPS workloads&lt;/li&gt;
&lt;li&gt;VMs with concurrent disk access&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;CPU Configuration&lt;/h2&gt;
&lt;h3&gt;CPU Type&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Host passthrough (best performance, limits migration)
qm set 100 --cpu host

# Specific type (allows migration between similar CPUs)
qm set 100 --cpu kvm64

# With flags (enable specific features)
qm set 100 --cpu host,flags=+aes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;host&lt;/strong&gt; gives best performance but limits live migration to identical CPUs.&lt;/p&gt;
&lt;h3&gt;NUMA Awareness&lt;/h3&gt;
&lt;p&gt;NUMA (Non-Uniform Memory Access) matters on multi-socket systems. Memory attached to one socket is faster for CPUs on that socket.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check host NUMA topology
numactl --hardware

# Example output:
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 0 size: 32768 MB
# node 1 size: 32768 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure NUMA for VMs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable NUMA for VM
qm set 100 --numa 1

# Pin VM to specific NUMA node
qm set 100 --numa0 cpus=0-3,memory=8192

# For large VMs spanning nodes
qm set 100 --numa0 cpus=0-3,memory=4096
qm set 100 --numa1 cpus=8-11,memory=4096
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CPU Pinning&lt;/h3&gt;
&lt;p&gt;Dedicate specific CPUs to a VM (reduces context switching):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Pin VM to CPUs 0-3
qm set 100 --affinity 0-3

# Or via NUMA config
qm set 100 --numa0 cpus=0-3,memory=8192
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Caution&lt;/strong&gt;: Over-pinning leaves other VMs fighting for remaining CPUs.&lt;/p&gt;
&lt;h2&gt;Hugepages&lt;/h2&gt;
&lt;p&gt;Normal memory pages are 4KB. Hugepages (2MB or 1GB) reduce TLB misses for memory-intensive workloads.&lt;/p&gt;
&lt;h3&gt;Enable Hugepages&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Reserve hugepages on host
echo 4096 &amp;gt; /proc/sys/vm/nr_hugepages  # 4096 × 2MB = 8GB

# Make persistent
echo &quot;vm.nr_hugepages = 4096&quot; &amp;gt;&amp;gt; /etc/sysctl.conf

# Verify
grep Huge /proc/meminfo
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure VM for Hugepages&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable hugepages for VM
qm set 100 --hugepages 2

# Values: 2 (2MB pages), 1024 (1GB pages), any (auto)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When Hugepages Help&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Large VMs (8GB+ RAM)&lt;/li&gt;
&lt;li&gt;Memory-intensive workloads (databases)&lt;/li&gt;
&lt;li&gt;Many VMs with significant memory&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Memory Ballooning&lt;/h2&gt;
&lt;p&gt;Balloon driver lets host reclaim unused VM memory.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Enable ballooning
qm set 100 --balloon 2048  # Minimum memory
qm set 100 --memory 8192   # Maximum memory

# VM starts with 8GB, can shrink to 2GB if host needs RAM
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Ballooning Trade-offs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Pro: Better memory utilization across VMs&lt;/li&gt;
&lt;li&gt;Con: Performance impact when balloon inflates&lt;/li&gt;
&lt;li&gt;Con: Swap inside VM if balloon too aggressive&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For latency-sensitive VMs, disable ballooning:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm set 100 --balloon 0
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Network Performance&lt;/h2&gt;
&lt;h3&gt;Multiqueue&lt;/h3&gt;
&lt;p&gt;Enable multiple queues for virtio-net:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Enable multiqueue (match to VM vCPUs, max 8)
qm set 100 --net0 virtio,bridge=vmbr0,queues=4
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Inside VM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Set queues on interface
ethtool -L eth0 combined 4
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Vhost-net&lt;/h3&gt;
&lt;p&gt;Offload network processing to kernel (usually enabled by default):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Verify vhost-net is loaded
lsmod | grep vhost_net

# If not loaded
modprobe vhost_net
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Storage Performance&lt;/h2&gt;
&lt;h3&gt;ZFS Tuning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check ARC size
arc_summary | grep &quot;ARC size&quot;

# Increase ARC max (if you have RAM)
echo &quot;options zfs zfs_arc_max=8589934592&quot; &amp;gt; /etc/modprobe.d/zfs.conf  # 8GB

# For SSDs, adjust transaction group timing
# (faster sync, lower latency)
echo 5 &amp;gt; /sys/module/zfs/parameters/zfs_txg_timeout
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;LVM-thin Tuning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check thin pool status
lvs -o+data_percent

# Zeroing (disable for SSD, faster provisioning)
lvchange --zero n pve/data
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Ceph Tuning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check pool settings
ceph osd pool get vmpool all

# Increase pg_num if needed
ceph osd pool set vmpool pg_num 256

# Adjust recovery (if impacting production)
ceph config set osd osd_recovery_max_active 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Bottlenecks&lt;/h2&gt;
&lt;h3&gt;CPU Bottleneck&lt;/h3&gt;
&lt;p&gt;Symptoms: High CPU usage, steal time in VMs&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check host CPU
mpstat -P ALL 1

# Check VM steal time
top  # %st column

# Solutions:
# - Reduce VM count
# - Pin VMs to specific CPUs
# - Upgrade host CPU
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Memory Bottleneck&lt;/h3&gt;
&lt;p&gt;Symptoms: Swapping, OOM, balloon activity&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check host memory
free -h
cat /proc/meminfo | grep -E &quot;MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree&quot;

# Check ZFS ARC (consuming RAM)
arc_summary | head -20

# Solutions:
# - Reduce ZFS ARC max
# - Reduce VM memory
# - Add more host RAM
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Storage Bottleneck&lt;/h3&gt;
&lt;p&gt;Symptoms: High I/O wait, slow disk operations&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check disk latency
iostat -x 1

# Look for:
# - await &amp;gt; 10ms (spinning disk) or &amp;gt; 1ms (SSD)
# - %util &amp;gt; 80%

# Solutions:
# - Move to faster storage
# - Enable IO threads
# - Reduce concurrent I/O (fewer VMs)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Network Bottleneck&lt;/h3&gt;
&lt;p&gt;Symptoms: Low throughput, high latency&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check interface utilization
iftop -i vmbr0

# Check for errors
ip -s link show vmbr0

# Solutions:
# - Enable virtio multiqueue
# - Bond multiple NICs
# - Upgrade to faster network
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Performance Testing Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Baseline&lt;/strong&gt;: Measure current performance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Identify&lt;/strong&gt;: Find the bottleneck (CPU, RAM, disk, network)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Change&lt;/strong&gt;: Make ONE change&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measure&lt;/strong&gt;: Test the same workload&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compare&lt;/strong&gt;: Did it improve?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iterate&lt;/strong&gt;: Repeat until satisfied&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Never change multiple things at once. You won&apos;t know what helped.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Optimization starts with measurement, not with tweaks.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Random performance settings from forums:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Might help your workload&lt;/li&gt;
&lt;li&gt;Might hurt your workload&lt;/li&gt;
&lt;li&gt;Might do nothing&lt;/li&gt;
&lt;li&gt;You won&apos;t know which without measuring&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Measure baseline&lt;/li&gt;
&lt;li&gt;Identify bottleneck&lt;/li&gt;
&lt;li&gt;Research solutions for THAT bottleneck&lt;/li&gt;
&lt;li&gt;Apply change&lt;/li&gt;
&lt;li&gt;Measure again&lt;/li&gt;
&lt;li&gt;Keep or revert&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Performance tuning isn&apos;t about knowing magic settings. It&apos;s about understanding your workload, measuring it, and systematically removing bottlenecks.&lt;/p&gt;
&lt;p&gt;The best optimization is often avoiding the problem: use VirtIO, use SSDs, have enough RAM. The tweaks come after the fundamentals are right.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>tuning</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Observability: Metrics, Logs, Alerts — What I Monitor on Proxmox</title><link>https://ashimov.com/posts/proxmox-observability/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-observability/</guid><description>Complete Proxmox monitoring setup. Covers node metrics, storage health, ZFS/Ceph monitoring, log aggregation, alerting rules, and why you cannot manage what you cannot see.</description><pubDate>Tue, 23 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The Proxmox web UI shows current state. It doesn&apos;t show trends. It doesn&apos;t show &quot;disk was filling up for weeks before it failed.&quot; It doesn&apos;t wake you up at 3 AM when something is about to break.&lt;/p&gt;
&lt;p&gt;Observability means knowing what&apos;s happening before users tell you. Metrics show trends. Logs show context. Alerts notify you before failures become outages.&lt;/p&gt;
&lt;p&gt;You can&apos;t manage what you can&apos;t see. And the Proxmox UI isn&apos;t enough to see.&lt;/p&gt;
&lt;h2&gt;What to Monitor&lt;/h2&gt;
&lt;h3&gt;Host Metrics&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;th&gt;Alert Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU usage&lt;/td&gt;
&lt;td&gt;Overloaded host&lt;/td&gt;
&lt;td&gt;&amp;gt;90% for 5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;OOM risk&lt;/td&gt;
&lt;td&gt;&amp;gt;85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load average&lt;/td&gt;
&lt;td&gt;System stress&lt;/td&gt;
&lt;td&gt;&amp;gt;cores×2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk I/O&lt;/td&gt;
&lt;td&gt;Storage bottleneck&lt;/td&gt;
&lt;td&gt;Latency &amp;gt;50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network I/O&lt;/td&gt;
&lt;td&gt;Bandwidth saturation&lt;/td&gt;
&lt;td&gt;&amp;gt;80% capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Storage Metrics&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;th&gt;Alert Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Disk space&lt;/td&gt;
&lt;td&gt;Running out&lt;/td&gt;
&lt;td&gt;&amp;gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZFS pool health&lt;/td&gt;
&lt;td&gt;Data integrity&lt;/td&gt;
&lt;td&gt;Any non-ONLINE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZFS ARC hit rate&lt;/td&gt;
&lt;td&gt;Cache efficiency&lt;/td&gt;
&lt;td&gt;Below 80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ceph health&lt;/td&gt;
&lt;td&gt;Cluster state&lt;/td&gt;
&lt;td&gt;Any non-HEALTH_OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SMART status&lt;/td&gt;
&lt;td&gt;Disk failure prediction&lt;/td&gt;
&lt;td&gt;Any warning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;VM Metrics&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;th&gt;Alert Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VM count&lt;/td&gt;
&lt;td&gt;Capacity planning&lt;/td&gt;
&lt;td&gt;Depends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Running vs stopped&lt;/td&gt;
&lt;td&gt;Unexpected states&lt;/td&gt;
&lt;td&gt;Any unexpected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU steal time&lt;/td&gt;
&lt;td&gt;Overcommit&lt;/td&gt;
&lt;td&gt;&amp;gt;10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Balloon memory&lt;/td&gt;
&lt;td&gt;Memory pressure&lt;/td&gt;
&lt;td&gt;Significant deflation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Prometheus + Grafana Setup&lt;/h2&gt;
&lt;p&gt;The standard stack: Prometheus scrapes metrics, Grafana visualizes.&lt;/p&gt;
&lt;h3&gt;Install on Separate VM&lt;/h3&gt;
&lt;p&gt;Don&apos;t monitor Proxmox from Proxmox. If the host dies, monitoring dies.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On monitoring VM - install Prometheus
apt update
apt install -y prometheus prometheus-node-exporter

# Add Grafana repository
apt install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo &quot;deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main&quot; | tee /etc/apt/sources.list.d/grafana.list
apt update
apt install -y grafana

systemctl enable --now prometheus prometheus-node-exporter grafana-server
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Proxmox PVE Exporter&lt;/h3&gt;
&lt;p&gt;Prometheus exporter specifically for Proxmox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Install
pip install prometheus-pve-exporter

# Create config
cat &amp;gt; /etc/pve-exporter.yml &amp;lt;&amp;lt; &apos;EOF&apos;
default:
  user: monitoring@pve
  token_name: prometheus
  token_value: &quot;xxxx-xxxx-xxxx&quot;
  verify_ssl: false
EOF

# Create systemd service
cat &amp;gt; /etc/systemd/system/pve-exporter.service &amp;lt;&amp;lt; &apos;EOF&apos;
[Unit]
Description=Prometheus PVE Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/pve-exporter.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now pve-exporter
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Node Exporter on Proxmox Hosts&lt;/h3&gt;
&lt;p&gt;Install on each Proxmox node:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;apt install prometheus-node-exporter
systemctl enable --now prometheus-node-exporter
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prometheus Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus self
  - job_name: &apos;prometheus&apos;
    static_configs:
      - targets: [&apos;localhost:9090&apos;]

  # Node exporters on Proxmox hosts
  - job_name: &apos;proxmox-nodes&apos;
    static_configs:
      - targets:
          - &apos;pve1:9100&apos;
          - &apos;pve2:9100&apos;
          - &apos;pve3:9100&apos;

  # PVE exporter
  - job_name: &apos;proxmox-pve&apos;
    static_configs:
      - targets:
          - &apos;localhost:9221&apos;
    metrics_path: /pve
    params:
      module: [default]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Grafana Dashboards&lt;/h3&gt;
&lt;p&gt;Import community dashboards:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Proxmox VE&lt;/strong&gt;: Dashboard ID 10347&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Node Exporter Full&lt;/strong&gt;: Dashboard ID 1860&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ZFS&lt;/strong&gt;: Dashboard ID 11337&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ceph&lt;/strong&gt;: Dashboard ID 2842&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Or create custom dashboards for your specific needs.&lt;/p&gt;
&lt;h2&gt;ZFS Monitoring&lt;/h2&gt;
&lt;h3&gt;ZFS Exporter&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Install
pip install prometheus-zfs-exporter

# Run
zfs_exporter --port 9134
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key ZFS Metrics&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Prometheus queries

# Pool capacity
zfs_pool_allocated_bytes / zfs_pool_size_bytes * 100

# ARC hit rate
rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100

# Scrub errors
zfs_pool_scrub_errors_total

# Pool state (1 = ONLINE)
zfs_pool_health == 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;ZFS Alerts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/prometheus/rules/zfs.yml
groups:
  - name: zfs
    rules:
      - alert: ZFSPoolDegraded
        expr: zfs_pool_health != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: &quot;ZFS pool {{ $labels.pool }} is degraded&quot;

      - alert: ZFSPoolSpaceLow
        expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 &amp;gt; 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: &quot;ZFS pool {{ $labels.pool }} is {{ $value }}% full&quot;

      - alert: ZFSScrubErrors
        expr: zfs_pool_scrub_errors_total &amp;gt; 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: &quot;ZFS pool {{ $labels.pool }} has scrub errors&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Ceph Monitoring&lt;/h2&gt;
&lt;h3&gt;Built-in Ceph Metrics&lt;/h3&gt;
&lt;p&gt;Ceph exposes Prometheus metrics natively:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On Ceph manager node
ceph mgr module enable prometheus

# Metrics at
# http://ceph-mgr:9283/metrics
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prometheus Config for Ceph&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Add to prometheus.yml
- job_name: &apos;ceph&apos;
  static_configs:
    - targets:
        - &apos;pve1:9283&apos;  # Ceph manager
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Ceph Alerts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/prometheus/rules/ceph.yml
groups:
  - name: ceph
    rules:
      - alert: CephHealthWarning
        expr: ceph_health_status == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: &quot;Ceph cluster health is WARN&quot;

      - alert: CephHealthCritical
        expr: ceph_health_status == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: &quot;Ceph cluster health is CRITICAL&quot;

      - alert: CephOSDDown
        expr: ceph_osd_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: &quot;Ceph OSD {{ $labels.osd }} is down&quot;

      - alert: CephPGsUnclean
        expr: ceph_pg_total - ceph_pg_active_clean &amp;gt; 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: &quot;Ceph has {{ $value }} unclean PGs&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;SMART Monitoring&lt;/h2&gt;
&lt;p&gt;Predict disk failures before they happen:&lt;/p&gt;
&lt;h3&gt;Install smartmontools&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On each Proxmox node
apt install smartmontools

# Enable SMART on disks
smartctl --smart=on /dev/sda
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prometheus SMART Exporter&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Install
pip install prometheus-smart-exporter

# Run
smart_exporter --port 9110
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SMART Alerts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/prometheus/rules/smart.yml
groups:
  - name: smart
    rules:
      - alert: DiskSMARTWarning
        expr: smart_device_health != 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: &quot;Disk {{ $labels.device }} SMART health warning&quot;

      - alert: DiskReallocationCount
        expr: smart_raw_value{attribute_name=&quot;Reallocated_Sector_Ct&quot;} &amp;gt; 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: &quot;Disk {{ $labels.device }} has reallocated sectors&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Log Aggregation&lt;/h2&gt;
&lt;h3&gt;Loki for Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# docker-compose.yml for Loki
version: &quot;3&quot;
services:
  loki:
    image: grafana/loki:latest
    ports:
      - &quot;3100:3100&quot;
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

volumes:
  loki-data:
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Promtail on Proxmox Nodes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# /etc/promtail/config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: proxmox
    static_configs:
      - targets:
          - localhost
        labels:
          job: proxmox
          host: pve1
          __path__: /var/log/*.log

  - job_name: pve-cluster
    static_configs:
      - targets:
          - localhost
        labels:
          job: pve-cluster
          host: pve1
          __path__: /var/log/pve/tasks/*
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Logs to Monitor&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Proxmox-specific logs
/var/log/pveproxy.log      # Web UI access
/var/log/pve/tasks/         # Task logs
/var/log/pve-firewall.log  # Firewall logs

# System logs
/var/log/syslog            # General system
/var/log/auth.log          # Authentication
/var/log/kern.log          # Kernel messages

# Ceph logs (if using)
/var/log/ceph/*.log
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Alerting Rules Summary&lt;/h2&gt;
&lt;h3&gt;Critical Alerts (Page Immediately)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;groups:
  - name: critical
    rules:
      - alert: HostDown
        expr: up{job=&quot;proxmox-nodes&quot;} == 0
        for: 1m
        labels:
          severity: critical

      - alert: StorageCritical
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 &amp;lt; 10
        for: 1m
        labels:
          severity: critical

      - alert: MemoryExhausted
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 &amp;lt; 5
        for: 1m
        labels:
          severity: critical

      - alert: ZFSPoolFailed
        expr: zfs_pool_health == 2  # DEGRADED or worse
        for: 1m
        labels:
          severity: critical
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Warning Alerts (Check Soon)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;groups:
  - name: warnings
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=&quot;idle&quot;}[5m])) * 100) &amp;gt; 90
        for: 15m
        labels:
          severity: warning

      - alert: StorageWarning
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 &amp;lt; 20
        for: 5m
        labels:
          severity: warning

      - alert: BackupFailed
        expr: pve_storage_backup_last_success_time &amp;lt; (time() - 86400)
        for: 1h
        labels:
          severity: warning
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Alertmanager Configuration&lt;/h2&gt;
&lt;p&gt;Route alerts appropriately:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: &apos;smtp.example.com:587&apos;
  smtp_from: &apos;alerts@example.com&apos;

route:
  group_by: [&apos;alertname&apos;, &apos;severity&apos;]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: &apos;default&apos;

  routes:
    - match:
        severity: critical
      receiver: &apos;pagerduty&apos;
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: &apos;slack&apos;
      repeat_interval: 4h

receivers:
  - name: &apos;default&apos;
    email_configs:
      - to: &apos;admin@example.com&apos;

  - name: &apos;pagerduty&apos;
    pagerduty_configs:
      - service_key: &apos;xxx&apos;

  - name: &apos;slack&apos;
    slack_configs:
      - api_url: &apos;https://hooks.slack.com/services/xxx&apos;
        channel: &apos;#alerts&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Dashboard Overview&lt;/h2&gt;
&lt;p&gt;My Grafana home dashboard shows:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Row 1: Cluster Overview
├── Total nodes (up/down)
├── Total VMs (running/stopped)
├── Cluster storage usage
└── Active alerts

Row 2: Per-Node Resources
├── CPU usage per node
├── Memory usage per node
├── Network I/O per node
└── Disk I/O per node

Row 3: Storage Health
├── ZFS pool status
├── Ceph health (if using)
├── Storage capacity trends
└── SMART warnings

Row 4: Backups
├── Last backup time
├── Backup success rate
├── Backup storage usage
└── Restore test status
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;You can&apos;t manage what you can&apos;t see.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Proxmox UI shows now. Observability shows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What happened (logs)&lt;/li&gt;
&lt;li&gt;How things are trending (metrics)&lt;/li&gt;
&lt;li&gt;What&apos;s about to break (alerts)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The investment in monitoring pays off when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Disk fills up → you knew 2 weeks ago&lt;/li&gt;
&lt;li&gt;Host overloaded → you saw the trend&lt;/li&gt;
&lt;li&gt;Ceph degraded → alerted immediately&lt;/li&gt;
&lt;li&gt;Backup failed → notified same day&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without monitoring, you find out when users complain. With monitoring, you find out before users notice. That&apos;s the difference between reactive and proactive operations.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>monitoring</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>IP Management: Getting VM IPs Reliably (DHCP, MAC Mapping, Integrations)</title><link>https://ashimov.com/posts/proxmox-ip-management/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-ip-management/</guid><description>Reliable IP address management for Proxmox VMs. Covers DHCP strategies, MAC-to-IP mapping, router integrations, inventory collection, and why IP addresses are data that must be collected automatically.</description><pubDate>Fri, 19 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&quot;What&apos;s the IP of that VM?&quot; shouldn&apos;t require logging into Proxmox, checking DHCP leases, or guessing. IP addresses are infrastructure data. They should be queryable, predictable, and documented automatically.&lt;/p&gt;
&lt;p&gt;Manual IP tracking breaks at scale. Spreadsheets get stale. DHCP gives different IPs after reboot. Static IPs require manual configuration. None of this scales.&lt;/p&gt;
&lt;p&gt;IP addresses are data. They need to be collected automatically.&lt;/p&gt;
&lt;h2&gt;The IP Problem&lt;/h2&gt;
&lt;p&gt;VMs need IP addresses. Getting them reliably is harder than it looks:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DHCP challenges:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;IP changes on reboot (unless reserved)&lt;/li&gt;
&lt;li&gt;Lease expires, new IP assigned&lt;/li&gt;
&lt;li&gt;&quot;What IP did that VM get?&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Static IP challenges:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Manual configuration per VM&lt;/li&gt;
&lt;li&gt;Easy to have conflicts&lt;/li&gt;
&lt;li&gt;Doesn&apos;t work with templates (need customization)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cloud-init challenges:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works great for initial setup&lt;/li&gt;
&lt;li&gt;Changing IP requires VM recreation&lt;/li&gt;
&lt;li&gt;Need to track assigned IPs somewhere&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Strategy 1: DHCP with MAC Reservations&lt;/h2&gt;
&lt;p&gt;Most reliable for dynamic environments. DHCP server reserves IP based on MAC address.&lt;/p&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. VM created with specific MAC address
2. MAC registered in DHCP server with reserved IP
3. VM boots, requests DHCP
4. DHCP server gives reserved IP
5. IP is consistent across reboots
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Proxmox Side&lt;/h3&gt;
&lt;p&gt;Specify MAC address when creating VM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create VM with specific MAC
qm create 100 --name web-server --net0 virtio=BC:24:11:00:01:00,bridge=vmbr0

# Or update existing
qm set 100 --net0 virtio=BC:24:11:00:01:00,bridge=vmbr0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use a MAC address scheme:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;BC:24:11:XX:YY:ZZ
         │  │  └─ Sequence (00-FF)
         │  └──── VM ID low byte
         └─────── VM ID high byte

Example:
VM 100: BC:24:11:00:64:00 (0x64 = 100)
VM 101: BC:24:11:00:65:00
VM 256: BC:24:11:01:00:00
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Router Side (MikroTik Example)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Add DHCP reservation
/ip dhcp-server lease add address=10.0.0.100 mac-address=BC:24:11:00:64:00 server=dhcp1 comment=&quot;web-server&quot;

# Or via script for bulk
:foreach mac,ip in={
  &quot;BC:24:11:00:64:00&quot;=&quot;10.0.0.100&quot;;
  &quot;BC:24:11:00:65:00&quot;=&quot;10.0.0.101&quot;;
  &quot;BC:24:11:00:66:00&quot;=&quot;10.0.0.102&quot;
} do={
  /ip dhcp-server lease add address=$ip mac-address=$mac server=dhcp1
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Router Side (OPNsense/pfSense)&lt;/h3&gt;
&lt;p&gt;Services → DHCPv4 → [Interface] → DHCP Static Mappings&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;MAC address: BC:24:11:00:64:00
IP address: 10.0.0.100
Hostname: web-server
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Router Side (VyOS)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;configure
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 static-mapping web-server mac-address &apos;BC:24:11:00:64:00&apos;
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 static-mapping web-server ip-address &apos;10.0.0.100&apos;
commit
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Strategy 2: Cloud-Init Static IPs&lt;/h2&gt;
&lt;p&gt;For immutable VMs where IP is set at creation.&lt;/p&gt;
&lt;h3&gt;Terraform Example&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;resource &quot;proxmox_vm_qemu&quot; &quot;server&quot; {
  name   = &quot;web-server&quot;
  clone  = &quot;ubuntu-template&quot;

  ipconfig0 = &quot;ip=10.0.0.100/24,gw=10.0.0.1&quot;

  # IP is set via cloud-init at first boot
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Manual Cloud-Init&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;qm set 100 --ipconfig0 ip=10.0.0.100/24,gw=10.0.0.1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Tracking Static IPs&lt;/h3&gt;
&lt;p&gt;Maintain IP allocation in code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# inventory/ip-allocation.yml
networks:
  production:
    subnet: 10.0.0.0/24
    gateway: 10.0.0.1
    allocated:
      10.0.0.10: proxmox-host
      10.0.0.100: web-server-1
      10.0.0.101: web-server-2
      10.0.0.150: database

  management:
    subnet: 10.10.0.0/24
    gateway: 10.10.0.1
    allocated:
      10.10.0.10: proxmox-mgmt
      10.10.0.100: monitoring
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Strategy 3: IPAM Integration&lt;/h2&gt;
&lt;p&gt;For larger environments, use dedicated IPAM (IP Address Management).&lt;/p&gt;
&lt;h3&gt;phpIPAM Integration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Query IPAM for next available IP
curl -X POST &quot;https://ipam.example.com/api/app/addresses/first_free/3/&quot; \
  -H &quot;token: xxx&quot; \
  -d &quot;hostname=new-server&quot;

# Register IP
curl -X POST &quot;https://ipam.example.com/api/app/addresses/&quot; \
  -H &quot;token: xxx&quot; \
  -d &quot;subnetId=3&amp;amp;ip=10.0.0.105&amp;amp;hostname=new-server&amp;amp;description=Web server&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;NetBox Integration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;import pynetbox

nb = pynetbox.api(&apos;https://netbox.example.com&apos;, token=&apos;xxx&apos;)

# Get next available IP
prefix = nb.ipam.prefixes.get(prefix=&apos;10.0.0.0/24&apos;)
next_ip = prefix.available_ips.list()[0]

# Create IP assignment
nb.ipam.ip_addresses.create(
    address=str(next_ip),
    dns_name=&apos;web-server.lab.local&apos;,
    description=&apos;Web server&apos;,
    assigned_object_type=&apos;virtualization.virtualmachine&apos;,
    assigned_object_id=vm_id
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Collecting VM IPs from Proxmox&lt;/h2&gt;
&lt;h3&gt;Via QEMU Guest Agent&lt;/h3&gt;
&lt;p&gt;Requires qemu-guest-agent installed in VM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Get network info from running VM
qm guest cmd 100 network-get-interfaces

# Output includes IP addresses
# Parse with jq
qm guest cmd 100 network-get-interfaces | jq -r &apos;.[] | select(.name != &quot;lo&quot;) | .[&quot;ip-addresses&quot;][] | select(.[&quot;ip-address-type&quot;] == &quot;ipv4&quot;) | .[&quot;ip-address&quot;]&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Via API&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Get VM status including network
pvesh get /nodes/pve1/qemu/100/agent/network-get-interfaces

# Or for all VMs
for vmid in $(pvesh get /nodes/pve1/qemu --output-format json | jq -r &apos;.[].vmid&apos;); do
  echo &quot;VM ${vmid}:&quot;
  pvesh get /nodes/pve1/qemu/${vmid}/agent/network-get-interfaces 2&amp;gt;/dev/null | jq -r &apos;.result[] | select(.name != &quot;lo&quot;) | &quot;\(.name): \(.[&quot;ip-addresses&quot;][0][&quot;ip-address&quot;])&quot;&apos;
done
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Inventory Script&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/usr/bin/env python3
# collect-inventory.py

import json
import subprocess
from proxmoxer import ProxmoxAPI

proxmox = ProxmoxAPI(&apos;proxmox.lab.local&apos;, user=&apos;root@pam&apos;, password=&apos;xxx&apos;, verify_ssl=False)

inventory = {}

for node in proxmox.nodes.get():
    node_name = node[&apos;node&apos;]

    for vm in proxmox.nodes(node_name).qemu.get():
        vmid = vm[&apos;vmid&apos;]
        name = vm[&apos;name&apos;]
        status = vm[&apos;status&apos;]

        vm_info = {
            &apos;name&apos;: name,
            &apos;node&apos;: node_name,
            &apos;status&apos;: status,
            &apos;ip_addresses&apos;: []
        }

        if status == &apos;running&apos;:
            try:
                interfaces = proxmox.nodes(node_name).qemu(vmid).agent(&apos;network-get-interfaces&apos;).get()
                for iface in interfaces[&apos;result&apos;]:
                    if iface[&apos;name&apos;] != &apos;lo&apos;:
                        for addr in iface.get(&apos;ip-addresses&apos;, []):
                            if addr[&apos;ip-address-type&apos;] == &apos;ipv4&apos;:
                                vm_info[&apos;ip_addresses&apos;].append(addr[&apos;ip-address&apos;])
            except:
                pass  # Guest agent not available

        inventory[vmid] = vm_info

print(json.dumps(inventory, indent=2))
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Dynamic Ansible Inventory&lt;/h2&gt;
&lt;p&gt;Generate Ansible inventory from Proxmox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/usr/bin/env python3
# proxmox_inventory.py

import json
from proxmoxer import ProxmoxAPI

def get_inventory():
    proxmox = ProxmoxAPI(&apos;proxmox.lab.local&apos;,
                         user=&apos;ansible@pve!inventory&apos;,
                         token_name=&apos;inventory&apos;,
                         token_value=&apos;xxx&apos;,
                         verify_ssl=False)

    inventory = {
        &apos;_meta&apos;: {&apos;hostvars&apos;: {}},
        &apos;all&apos;: {&apos;children&apos;: [&apos;proxmox_vms&apos;]},
        &apos;proxmox_vms&apos;: {&apos;hosts&apos;: []}
    }

    for node in proxmox.nodes.get():
        for vm in proxmox.nodes(node[&apos;node&apos;]).qemu.get():
            if vm[&apos;status&apos;] != &apos;running&apos;:
                continue

            vmid = vm[&apos;vmid&apos;]
            name = vm[&apos;name&apos;]

            # Get IP from guest agent
            try:
                interfaces = proxmox.nodes(node[&apos;node&apos;]).qemu(vmid).agent(&apos;network-get-interfaces&apos;).get()
                for iface in interfaces[&apos;result&apos;]:
                    if iface[&apos;name&apos;] != &apos;lo&apos;:
                        for addr in iface.get(&apos;ip-addresses&apos;, []):
                            if addr[&apos;ip-address-type&apos;] == &apos;ipv4&apos;:
                                ip = addr[&apos;ip-address&apos;]
                                inventory[&apos;proxmox_vms&apos;][&apos;hosts&apos;].append(name)
                                inventory[&apos;_meta&apos;][&apos;hostvars&apos;][name] = {
                                    &apos;ansible_host&apos;: ip,
                                    &apos;proxmox_vmid&apos;: vmid,
                                    &apos;proxmox_node&apos;: node[&apos;node&apos;]
                                }
                                break
            except:
                pass

    return inventory

if __name__ == &apos;__main__&apos;:
    print(json.dumps(get_inventory(), indent=2))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Usage:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Use dynamic inventory
ansible -i proxmox_inventory.py all -m ping

# In ansible.cfg
[defaults]
inventory = ./proxmox_inventory.py
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;DNS Integration&lt;/h2&gt;
&lt;p&gt;Automatically register VMs in DNS:&lt;/p&gt;
&lt;h3&gt;With PowerDNS API&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# register-dns.sh

VM_NAME=$1
IP=$2
DOMAIN=&quot;lab.local&quot;
PDNS_API=&quot;http://dns.lab.local:8081/api/v1&quot;
PDNS_KEY=&quot;xxx&quot;

# Add A record
curl -X PATCH &quot;${PDNS_API}/servers/localhost/zones/${DOMAIN}.&quot; \
  -H &quot;X-API-Key: ${PDNS_KEY}&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d &quot;{
    \&quot;rrsets\&quot;: [{
      \&quot;name\&quot;: \&quot;${VM_NAME}.${DOMAIN}.\&quot;,
      \&quot;type\&quot;: \&quot;A\&quot;,
      \&quot;ttl\&quot;: 300,
      \&quot;changetype\&quot;: \&quot;REPLACE\&quot;,
      \&quot;records\&quot;: [{\&quot;content\&quot;: \&quot;${IP}\&quot;, \&quot;disabled\&quot;: false}]
    }]
  }&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;With nsupdate (BIND)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# update-dns.sh

VM_NAME=$1
IP=$2
DOMAIN=&quot;lab.local&quot;
DNS_SERVER=&quot;10.0.0.53&quot;
KEY_FILE=&quot;/etc/bind/keys/update.key&quot;

nsupdate -k ${KEY_FILE} &amp;lt;&amp;lt; EOF
server ${DNS_SERVER}
zone ${DOMAIN}
update delete ${VM_NAME}.${DOMAIN} A
update add ${VM_NAME}.${DOMAIN} 300 A ${IP}
send
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Automation Pipeline&lt;/h2&gt;
&lt;p&gt;Complete workflow:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# vm-creation.yml (Ansible)
- name: Create VM with managed IP
  hosts: localhost
  vars:
    vm_name: web-server
    vm_id: 100
    mac_address: &quot;BC:24:11:00:64:00&quot;
    ip_address: &quot;10.0.0.100&quot;

  tasks:
    - name: Create VM in Proxmox
      community.general.proxmox_kvm:
        api_host: proxmox.lab.local
        api_token_id: terraform
        api_token_secret: &quot;{{ vault_proxmox_token }}&quot;
        node: pve1
        vmid: &quot;{{ vm_id }}&quot;
        name: &quot;{{ vm_name }}&quot;
        clone: ubuntu-template
        net:
          net0: &quot;virtio={{ mac_address }},bridge=vmbr0&quot;

    - name: Register DHCP reservation on router
      community.routeros.command:
        commands:
          - /ip dhcp-server lease add address={{ ip_address }} mac-address={{ mac_address }} server=dhcp1 comment=&quot;{{ vm_name }}&quot;
      delegate_to: router

    - name: Register DNS
      community.general.nsupdate:
        server: &quot;10.0.0.53&quot;
        zone: &quot;lab.local&quot;
        record: &quot;{{ vm_name }}&quot;
        type: &quot;A&quot;
        value: &quot;{{ ip_address }}&quot;

    - name: Start VM
      community.general.proxmox_kvm:
        api_host: proxmox.lab.local
        api_token_id: terraform
        api_token_secret: &quot;{{ vault_proxmox_token }}&quot;
        node: pve1
        vmid: &quot;{{ vm_id }}&quot;
        state: started

    - name: Wait for VM to be reachable
      wait_for:
        host: &quot;{{ ip_address }}&quot;
        port: 22
        delay: 10
        timeout: 300

    - name: Update inventory
      lineinfile:
        path: inventory/hosts
        line: &quot;{{ vm_name }} ansible_host={{ ip_address }}&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;IP addresses are data. They must be collected automatically.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Manual IP management fails because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Humans forget to update documentation&lt;/li&gt;
&lt;li&gt;Spreadsheets get stale&lt;/li&gt;
&lt;li&gt;&quot;What IP is that?&quot; becomes a daily question&lt;/li&gt;
&lt;li&gt;Conflicts happen because no one checked&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Automated IP management works because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DHCP reservations are code, versioned and reviewable&lt;/li&gt;
&lt;li&gt;Inventory is generated from actual state&lt;/li&gt;
&lt;li&gt;DNS updates automatically&lt;/li&gt;
&lt;li&gt;Conflicts are detected before deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choose your strategy based on scale:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small (1-20 VMs): DHCP reservations, manual tracking&lt;/li&gt;
&lt;li&gt;Medium (20-100 VMs): Cloud-init static IPs, generated inventory&lt;/li&gt;
&lt;li&gt;Large (100+ VMs): IPAM integration (NetBox, phpIPAM)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The goal is always the same: asking &quot;what&apos;s the IP?&quot; should return an answer in seconds, from automation, not from hunting through UIs and logs.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>automation</category><category>networking</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Golden Images Pipeline: Building Templates Like a Factory</title><link>https://ashimov.com/posts/proxmox-golden-images/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-golden-images/</guid><description>Automated VM template creation for Proxmox. Covers Packer integration, cloud-init pipelines, image versioning, testing, and why images must be reproducible or they become unique snowflakes.</description><pubDate>Tue, 16 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Manual template creation works once. Install OS, configure, convert to template. Done. Until you need to update it. Then you do it again manually. And again. Eventually, no one remembers what&apos;s in the template or how it was built.&lt;/p&gt;
&lt;p&gt;A golden image pipeline treats templates like software: version controlled, automatically built, tested before use. When you need an update, you change the code and the pipeline builds a new template.&lt;/p&gt;
&lt;p&gt;Images must be reproducible. If you can&apos;t rebuild an identical image from code, you have a unique snowflake, not a template.&lt;/p&gt;
&lt;h2&gt;What Makes a Golden Image&lt;/h2&gt;
&lt;p&gt;A golden image is a VM template with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Known contents&lt;/strong&gt;: Every package, config, and file is documented&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reproducible build&lt;/strong&gt;: Same inputs = same output&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioned&lt;/strong&gt;: v1, v2, v3 with change history&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tested&lt;/strong&gt;: Verified before production use&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Immutable&lt;/strong&gt;: Never modified after creation&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Manual Pipeline (Simple Start)&lt;/h2&gt;
&lt;p&gt;Before Packer, understand the process:&lt;/p&gt;
&lt;h3&gt;1. Download Cloud Image&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Ubuntu cloud image
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Debian cloud image
wget https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2

# AlmaLinux cloud image
wget https://repo.almalinux.org/almalinux/9/cloud/x86_64/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Create VM and Import&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create VM
qm create 9000 --name &quot;ubuntu-2404-base&quot; --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci

# Import cloud image
qm importdisk 9000 noble-server-cloudimg-amd64.img local-zfs

# Attach disk
qm set 9000 --scsi0 local-zfs:vm-9000-disk-0

# Add cloud-init drive
qm set 9000 --ide2 local-zfs:cloudinit

# Boot settings
qm set 9000 --boot c --bootdisk scsi0
qm set 9000 --serial0 socket --vga serial0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Customize (Optional)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Start VM with temporary cloud-init
qm set 9000 --ciuser temp --cipassword temp123
qm start 9000

# SSH in and customize
ssh temp@&amp;lt;ip&amp;gt;
sudo apt update &amp;amp;&amp;amp; sudo apt upgrade -y
sudo apt install -y qemu-guest-agent vim htop curl git

# Clean up
sudo cloud-init clean
sudo rm -rf /var/log/*.log
sudo rm -rf /tmp/*
sudo shutdown -h now
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Convert to Template&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Remove temporary cloud-init settings
qm set 9000 --delete ciuser,cipassword

# Convert to template
qm template 9000
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. Version and Document&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Rename with version
qm set 9000 --name &quot;ubuntu-2404-v1&quot;

# Add description
qm set 9000 --description &quot;Ubuntu 24.04 LTS
Version: 1
Date: 2025-01-08
Base: noble-server-cloudimg-amd64.img
Packages: qemu-guest-agent, vim, htop, curl, git
Changes: Initial release&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Packer Pipeline (Automated)&lt;/h2&gt;
&lt;p&gt;Packer automates the entire process.&lt;/p&gt;
&lt;h3&gt;Install Packer&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# HashiCorp repository
wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo &quot;deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main&quot; | sudo tee /etc/apt/sources.list.d/hashicorp.list
apt update &amp;amp;&amp;amp; apt install packer

# Install Proxmox plugin
packer plugins install github.com/hashicorp/proxmox
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Directory Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;packer-templates/
├── templates/
│   ├── ubuntu-2404/
│   │   ├── ubuntu-2404.pkr.hcl
│   │   ├── variables.pkr.hcl
│   │   ├── http/
│   │   │   └── user-data
│   │   └── scripts/
│   │       ├── base.sh
│   │       ├── cleanup.sh
│   │       └── cloud-init.sh
│   └── debian-12/
│       └── ...
├── common/
│   ├── scripts/
│   │   └── common-packages.sh
│   └── files/
│       └── motd
└── Makefile
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Packer Template&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# templates/ubuntu-2404/ubuntu-2404.pkr.hcl

packer {
  required_plugins {
    proxmox = {
      version = &quot;&amp;gt;= 1.1.0&quot;
      source  = &quot;github.com/hashicorp/proxmox&quot;
    }
  }
}

source &quot;proxmox-iso&quot; &quot;ubuntu-2404&quot; {
  # Proxmox connection
  proxmox_url              = var.proxmox_url
  username                 = var.proxmox_username
  token                    = var.proxmox_token
  node                     = var.proxmox_node
  insecure_skip_tls_verify = true

  # VM settings
  vm_id   = var.vm_id
  vm_name = &quot;ubuntu-2404-v${var.version}&quot;

  # ISO (get checksum from releases.ubuntu.com)
  iso_url          = &quot;https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso&quot;
  iso_checksum     = &quot;sha256:YOUR_CHECKSUM_HERE&quot;  # Get from SHA256SUMS file
  iso_storage_pool = &quot;local&quot;
  unmount_iso      = true

  # Hardware
  cores    = 2
  memory   = 2048
  cpu_type = &quot;host&quot;

  # Disk
  scsi_controller = &quot;virtio-scsi-pci&quot;
  disks {
    disk_size    = &quot;32G&quot;
    storage_pool = var.storage_pool
    type         = &quot;scsi&quot;
  }

  # Network
  network_adapters {
    model  = &quot;virtio&quot;
    bridge = &quot;vmbr0&quot;
  }

  # Cloud-init
  cloud_init              = true
  cloud_init_storage_pool = var.storage_pool

  # Boot
  boot_command = [
    &quot;&amp;lt;esc&amp;gt;&amp;lt;wait&amp;gt;&quot;,
    &quot;e&amp;lt;wait&amp;gt;&quot;,
    &quot;&amp;lt;down&amp;gt;&amp;lt;down&amp;gt;&amp;lt;down&amp;gt;&amp;lt;end&amp;gt;&quot;,
    &quot; autoinstall ds=nocloud-net;s=http://{{ .HTTPIP }}:{{ .HTTPPort }}/&quot;,
    &quot;&amp;lt;f10&amp;gt;&quot;
  ]

  boot_wait = &quot;5s&quot;

  # HTTP server for autoinstall
  http_directory = &quot;http&quot;

  # SSH
  ssh_username = &quot;packer&quot;
  ssh_password = &quot;packer&quot;
  ssh_timeout  = &quot;20m&quot;

  # Template
  template_name        = &quot;ubuntu-2404-v${var.version}&quot;
  template_description = &quot;Ubuntu 24.04 LTS - Built ${timestamp()}&quot;
}

build {
  sources = [&quot;source.proxmox-iso.ubuntu-2404&quot;]

  # Base configuration
  provisioner &quot;shell&quot; {
    scripts = [
      &quot;scripts/base.sh&quot;,
      &quot;../../common/scripts/common-packages.sh&quot;
    ]
  }

  # Copy files
  provisioner &quot;file&quot; {
    source      = &quot;../../common/files/motd&quot;
    destination = &quot;/tmp/motd&quot;
  }

  provisioner &quot;shell&quot; {
    inline = [&quot;sudo mv /tmp/motd /etc/motd&quot;]
  }

  # Cleanup
  provisioner &quot;shell&quot; {
    scripts = [
      &quot;scripts/cleanup.sh&quot;,
      &quot;scripts/cloud-init.sh&quot;
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Variables&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# templates/ubuntu-2404/variables.pkr.hcl

variable &quot;proxmox_url&quot; {
  type    = string
  default = &quot;https://proxmox.lab.local:8006/api2/json&quot;
}

variable &quot;proxmox_username&quot; {
  type    = string
  default = &quot;packer@pve!automation&quot;
}

variable &quot;proxmox_token&quot; {
  type      = string
  sensitive = true
}

variable &quot;proxmox_node&quot; {
  type    = string
  default = &quot;pve1&quot;
}

variable &quot;vm_id&quot; {
  type    = number
  default = 9000
}

variable &quot;storage_pool&quot; {
  type    = string
  default = &quot;local-zfs&quot;
}

variable &quot;version&quot; {
  type    = string
  default = &quot;1&quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Provisioning Scripts&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# scripts/base.sh
#!/bin/bash
set -ex

# Wait for cloud-init
cloud-init status --wait

# Update system
sudo apt update
sudo apt upgrade -y

# Install packages
sudo apt install -y \
  qemu-guest-agent \
  vim \
  htop \
  curl \
  wget \
  git \
  jq \
  unzip

# Enable guest agent
sudo systemctl enable qemu-guest-agent
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# scripts/cleanup.sh
#!/bin/bash
set -ex

# Clean apt
sudo apt autoremove -y
sudo apt clean

# Clean logs
sudo journalctl --rotate
sudo journalctl --vacuum-time=1s
sudo rm -rf /var/log/*.log
sudo rm -rf /var/log/*.gz

# Clean temp
sudo rm -rf /tmp/*
sudo rm -rf /var/tmp/*

# Clean SSH keys (regenerate on first boot)
sudo rm -f /etc/ssh/ssh_host_*

# Clean machine-id
sudo truncate -s 0 /etc/machine-id
sudo rm -f /var/lib/dbus/machine-id

# Clean history
history -c
sudo rm -f /root/.bash_history
sudo rm -f /home/*/.bash_history
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# scripts/cloud-init.sh
#!/bin/bash
set -ex

# Reset cloud-init
sudo cloud-init clean

# Remove network config (cloud-init will regenerate)
sudo rm -f /etc/netplan/*.yaml

# The template is now ready
echo &quot;Template preparation complete&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Build Template&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Set token
export PKR_VAR_proxmox_token=&quot;xxxx-xxxx-xxxx&quot;

# Validate
packer validate templates/ubuntu-2404/

# Build
packer build -var &quot;version=2&quot; templates/ubuntu-2404/
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Cloud Image Pipeline (Faster)&lt;/h2&gt;
&lt;p&gt;Skip ISO install by starting with cloud images:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# templates/ubuntu-2404-cloud/ubuntu-2404-cloud.pkr.hcl

source &quot;proxmox-clone&quot; &quot;ubuntu-2404&quot; {
  # Clone from uploaded cloud image
  clone_vm = &quot;ubuntu-2404-cloud-base&quot;

  # ... rest of config
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pre-upload cloud image:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Download and upload cloud image to Proxmox
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Create base VM from cloud image (one-time)
qm create 8000 --name &quot;ubuntu-2404-cloud-base&quot; ...
qm importdisk 8000 noble-server-cloudimg-amd64.img local-zfs
qm set 8000 --scsi0 local-zfs:vm-8000-disk-0
qm template 8000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Packer then clones from base, customizes, creates versioned template.&lt;/p&gt;
&lt;h2&gt;CI/CD Integration&lt;/h2&gt;
&lt;h3&gt;GitLab CI&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# .gitlab-ci.yml
stages:
  - validate
  - build
  - test

variables:
  PACKER_VERSION: &quot;1.10.0&quot;

validate:
  stage: validate
  script:
    - packer init templates/ubuntu-2404/
    - packer validate templates/ubuntu-2404/

build:
  stage: build
  script:
    - packer build -var &quot;version=${CI_PIPELINE_IID}&quot; templates/ubuntu-2404/
  only:
    - main

test:
  stage: test
  script:
    - ./scripts/test-template.sh ubuntu-2404-v${CI_PIPELINE_IID}
  only:
    - main
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;GitHub Actions&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# .github/workflows/build-template.yml
name: Build Template

on:
  push:
    branches: [main]
    paths:
      - &apos;templates/**&apos;

jobs:
  build:
    runs-on: self-hosted  # Need access to Proxmox
    steps:
      - uses: actions/checkout@v4

      - name: Setup Packer
        uses: hashicorp/setup-packer@main
        with:
          version: &quot;1.10.0&quot;

      - name: Init Packer
        run: packer init templates/ubuntu-2404/

      - name: Build Template
        env:
          PKR_VAR_proxmox_token: ${{ secrets.PROXMOX_TOKEN }}
        run: |
          packer build -var &quot;version=${{ github.run_number }}&quot; templates/ubuntu-2404/
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Testing Templates&lt;/h2&gt;
&lt;h3&gt;Automated Test Script&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
# scripts/test-template.sh

TEMPLATE=$1
TEST_VM_ID=9999

echo &quot;Testing template: ${TEMPLATE}&quot;

# Clone template
qm clone $(qm list | grep &quot;${TEMPLATE}&quot; | awk &apos;{print $1}&apos;) ${TEST_VM_ID} --name &quot;template-test&quot;

# Configure cloud-init
qm set ${TEST_VM_ID} --ciuser test --sshkeys ~/.ssh/id_ed25519.pub
qm set ${TEST_VM_ID} --ipconfig0 ip=dhcp

# Start VM
qm start ${TEST_VM_ID}

# Wait for boot
sleep 60

# Get IP
IP=$(qm guest cmd ${TEST_VM_ID} network-get-interfaces | jq -r &apos;.[1][&quot;ip-addresses&quot;][0][&quot;ip-address&quot;]&apos;)

# Run tests
echo &quot;Testing SSH...&quot;
ssh -o StrictHostKeyChecking=no test@${IP} &quot;echo &apos;SSH OK&apos;&quot;

echo &quot;Testing packages...&quot;
ssh test@${IP} &quot;which vim htop curl git&quot;

echo &quot;Testing guest agent...&quot;
qm agent ${TEST_VM_ID} ping

# Cleanup
qm stop ${TEST_VM_ID}
qm destroy ${TEST_VM_ID}

echo &quot;All tests passed!&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;[ ] VM boots successfully
[ ] Cloud-init completes
[ ] SSH works with key auth
[ ] Required packages installed
[ ] Guest agent responds
[ ] No leftover sensitive data
[ ] Machine-id regenerated
[ ] SSH host keys regenerated
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Versioning Strategy&lt;/h2&gt;
&lt;h3&gt;Semantic Versioning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;ubuntu-2404-v1.0.0    # Major: Breaking changes
ubuntu-2404-v1.1.0    # Minor: New features
ubuntu-2404-v1.1.1    # Patch: Bug fixes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Date-Based Versioning&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;ubuntu-2404-20250108  # Date of build
ubuntu-2404-202501    # Month of build
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Build Number&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;ubuntu-2404-b42       # CI build number
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Tracking Active Version&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create symlink-style alias
qm set 9002 --description &quot;... LATEST: true&quot;

# Or use tags
qm set 9002 --tags &quot;latest,ubuntu,2404&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Images must be reproducible. Otherwise, they&apos;re unique snowflakes.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A template you clicked together manually is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Undocumented (what&apos;s in it?)&lt;/li&gt;
&lt;li&gt;Unreproducible (can you rebuild it exactly?)&lt;/li&gt;
&lt;li&gt;Untested (does it actually work?)&lt;/li&gt;
&lt;li&gt;Unversioned (which version is this?)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A golden image pipeline produces templates that are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Documented (code shows everything)&lt;/li&gt;
&lt;li&gt;Reproducible (same code = same image)&lt;/li&gt;
&lt;li&gt;Tested (automated tests before use)&lt;/li&gt;
&lt;li&gt;Versioned (clear history of changes)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The investment in automation pays off when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Security update needed → rebuild all templates&lt;/li&gt;
&lt;li&gt;New requirement → change code, rebuild&lt;/li&gt;
&lt;li&gt;&quot;What&apos;s in this template?&quot; → read the code&lt;/li&gt;
&lt;li&gt;Audit requirement → show build history&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Start simple: document your manual process. Then automate it. Then add CI/CD. Each step makes your templates more reliable and less mysterious.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>automation</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Infrastructure as Code: Terraform Proxmox Provider — Patterns That Won&apos;t Rot</title><link>https://ashimov.com/posts/proxmox-terraform/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-terraform/</guid><description>Terraform with Proxmox done right. Covers provider configuration, module structure, state management, safe changes, and why IaC is about predictability, not faster clicking.</description><pubDate>Fri, 12 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Clicking through the Proxmox UI works for one VM. It doesn&apos;t work for thirty VMs that need to be consistent. It doesn&apos;t work when you need to recreate an environment. It doesn&apos;t work when &quot;what changed?&quot; matters.&lt;/p&gt;
&lt;p&gt;Terraform brings Infrastructure as Code to Proxmox: define VMs in files, track changes in Git, apply reproducibly. But Terraform with Proxmox has quirks. The provider has limitations. State can drift. Changes can be destructive.&lt;/p&gt;
&lt;p&gt;This is how to use Terraform with Proxmox in patterns that won&apos;t rot.&lt;/p&gt;
&lt;h2&gt;Why Terraform for Proxmox&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Terraform solves:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reproducible environments (dev = staging = prod)&lt;/li&gt;
&lt;li&gt;Change tracking (what changed, when, why)&lt;/li&gt;
&lt;li&gt;Collaboration (PRs, code review for infrastructure)&lt;/li&gt;
&lt;li&gt;Documentation (code is documentation)&lt;/li&gt;
&lt;li&gt;Disaster recovery (rebuild from code)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Terraform doesn&apos;t solve:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Day-2 operations inside VMs (use Ansible)&lt;/li&gt;
&lt;li&gt;Configuration management (use Ansible, Chef, Puppet)&lt;/li&gt;
&lt;li&gt;One-off tasks (just use the UI)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Provider Setup&lt;/h2&gt;
&lt;h3&gt;Install Provider&lt;/h3&gt;
&lt;p&gt;In your Terraform project:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# versions.tf
terraform {
  required_version = &quot;&amp;gt;= 1.0&quot;

  required_providers {
    proxmox = {
      source  = &quot;Telmate/proxmox&quot;
      version = &quot;~&amp;gt; 3.0&quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Provider Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# provider.tf
provider &quot;proxmox&quot; {
  pm_api_url          = &quot;https://proxmox.lab.local:8006/api2/json&quot;
  pm_api_token_id     = &quot;terraform@pve!automation&quot;
  pm_api_token_secret = var.proxmox_api_secret

  # TLS verification
  pm_tls_insecure = false  # Set true only for self-signed certs

  # Parallel operations
  pm_parallel = 4

  # Logging (for debugging)
  pm_log_enable = true
  pm_log_file   = &quot;terraform-plugin-proxmox.log&quot;
  pm_log_levels = {
    _default    = &quot;debug&quot;
    _capturelog = &quot;&quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;API Token Creation&lt;/h3&gt;
&lt;p&gt;On Proxmox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create dedicated Terraform user
pveum user add terraform@pve

# Create token with privilege separation disabled
pveum user token add terraform@pve automation --privsep 0

# Grant permissions
pveum acl modify / --user terraform@pve --role PVEAdmin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Store token in environment or secrets manager:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;export PM_API_TOKEN_SECRET=&quot;xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# variables.tf
variable &quot;proxmox_api_secret&quot; {
  description = &quot;Proxmox API token secret&quot;
  type        = string
  sensitive   = true
  default     = &quot;&quot; # Use TF_VAR_proxmox_api_secret env var
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Basic VM Resource&lt;/h2&gt;
&lt;h3&gt;Clone from Template&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# vm.tf
resource &quot;proxmox_vm_qemu&quot; &quot;web_server&quot; {
  name        = &quot;web-server-01&quot;
  target_node = &quot;pve1&quot;

  # Clone from template
  clone = &quot;ubuntu-2404-template&quot;
  full_clone = true

  # Hardware
  cores   = 2
  sockets = 1
  memory  = 4096

  # Agent (required for IP retrieval)
  agent = 1

  # Disk
  disks {
    scsi {
      scsi0 {
        disk {
          size    = &quot;32G&quot;
          storage = &quot;local-zfs&quot;
        }
      }
    }
  }

  # Network
  network {
    model  = &quot;virtio&quot;
    bridge = &quot;vmbr0&quot;
    tag    = 10
  }

  # Cloud-init
  os_type    = &quot;cloud-init&quot;
  ciuser     = &quot;admin&quot;
  cipassword = var.vm_password
  sshkeys    = file(&quot;~/.ssh/id_ed25519.pub&quot;)

  ipconfig0 = &quot;ip=10.10.0.100/24,gw=10.10.0.1&quot;

  # Lifecycle
  lifecycle {
    ignore_changes = [
      network,  # Don&apos;t recreate on network changes
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Output VM Info&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# outputs.tf
output &quot;web_server_ip&quot; {
  value       = proxmox_vm_qemu.web_server.default_ipv4_address
  description = &quot;Web server IP address&quot;
}

output &quot;web_server_id&quot; {
  value       = proxmox_vm_qemu.web_server.vmid
  description = &quot;VM ID in Proxmox&quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Module Structure&lt;/h2&gt;
&lt;p&gt;For reusable, maintainable code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;proxmox-terraform/
├── modules/
│   ├── vm/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── lxc/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   └── ...
│   └── prod/
│       └── ...
├── .gitignore
└── README.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VM Module&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# modules/vm/variables.tf
variable &quot;name&quot; {
  description = &quot;VM name&quot;
  type        = string
}

variable &quot;target_node&quot; {
  description = &quot;Proxmox node to create VM on&quot;
  type        = string
  default     = &quot;pve1&quot;
}

variable &quot;template&quot; {
  description = &quot;Template to clone from&quot;
  type        = string
  default     = &quot;ubuntu-2404-template&quot;
}

variable &quot;cores&quot; {
  description = &quot;Number of CPU cores&quot;
  type        = number
  default     = 2
}

variable &quot;memory&quot; {
  description = &quot;Memory in MB&quot;
  type        = number
  default     = 2048
}

variable &quot;disk_size&quot; {
  description = &quot;Disk size&quot;
  type        = string
  default     = &quot;32G&quot;
}

variable &quot;storage&quot; {
  description = &quot;Storage pool&quot;
  type        = string
  default     = &quot;local-zfs&quot;
}

variable &quot;network_bridge&quot; {
  description = &quot;Network bridge&quot;
  type        = string
  default     = &quot;vmbr0&quot;
}

variable &quot;vlan_tag&quot; {
  description = &quot;VLAN tag&quot;
  type        = number
  default     = null
}

variable &quot;ip_address&quot; {
  description = &quot;Static IP address with CIDR&quot;
  type        = string
}

variable &quot;gateway&quot; {
  description = &quot;Default gateway&quot;
  type        = string
}

variable &quot;ssh_keys&quot; {
  description = &quot;SSH public keys&quot;
  type        = string
}

variable &quot;tags&quot; {
  description = &quot;VM tags&quot;
  type        = list(string)
  default     = []
}
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# modules/vm/main.tf
resource &quot;proxmox_vm_qemu&quot; &quot;vm&quot; {
  name        = var.name
  target_node = var.target_node
  clone       = var.template
  full_clone  = true

  cores   = var.cores
  sockets = 1
  memory  = var.memory
  agent   = 1

  disks {
    scsi {
      scsi0 {
        disk {
          size    = var.disk_size
          storage = var.storage
        }
      }
    }
  }

  network {
    model  = &quot;virtio&quot;
    bridge = var.network_bridge
    tag    = var.vlan_tag
  }

  os_type = &quot;cloud-init&quot;
  sshkeys = var.ssh_keys

  ipconfig0 = &quot;ip=${var.ip_address},gw=${var.gateway}&quot;

  tags = join(&quot;,&quot;, var.tags)

  lifecycle {
    ignore_changes = [network]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;# modules/vm/outputs.tf
output &quot;id&quot; {
  value = proxmox_vm_qemu.vm.vmid
}

output &quot;name&quot; {
  value = proxmox_vm_qemu.vm.name
}

output &quot;ip_address&quot; {
  value = proxmox_vm_qemu.vm.default_ipv4_address
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Modules&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# environments/dev/main.tf
module &quot;web_servers&quot; {
  source = &quot;../../modules/vm&quot;

  count = 2

  name        = &quot;web-${count.index + 1}&quot;
  target_node = &quot;pve1&quot;
  template    = &quot;ubuntu-2404-template&quot;

  cores    = 2
  memory   = 4096
  disk_size = &quot;32G&quot;

  ip_address = &quot;10.10.0.${100 + count.index}/24&quot;
  gateway    = &quot;10.10.0.1&quot;

  ssh_keys = file(&quot;~/.ssh/id_ed25519.pub&quot;)

  tags = [&quot;web&quot;, &quot;dev&quot;]
}

module &quot;database&quot; {
  source = &quot;../../modules/vm&quot;

  name        = &quot;db-1&quot;
  target_node = &quot;pve1&quot;
  template    = &quot;ubuntu-2404-template&quot;

  cores    = 4
  memory   = 8192
  disk_size = &quot;100G&quot;

  ip_address = &quot;10.10.0.50/24&quot;
  gateway    = &quot;10.10.0.1&quot;

  ssh_keys = file(&quot;~/.ssh/id_ed25519.pub&quot;)

  tags = [&quot;database&quot;, &quot;dev&quot;]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;State Management&lt;/h2&gt;
&lt;h3&gt;Remote State&lt;/h3&gt;
&lt;p&gt;Never use local state for teams:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# backend.tf
terraform {
  backend &quot;s3&quot; {
    bucket         = &quot;terraform-state&quot;
    key            = &quot;proxmox/dev/terraform.tfstate&quot;
    region         = &quot;us-east-1&quot;
    encrypt        = true
    dynamodb_table = &quot;terraform-locks&quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or with Terraform Cloud:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;terraform {
  cloud {
    organization = &quot;my-org&quot;
    workspaces {
      name = &quot;proxmox-dev&quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;State Drift&lt;/h3&gt;
&lt;p&gt;Proxmox changes outside Terraform cause drift:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check for drift
terraform plan

# If drift detected, either:
# 1. Import the change into state
# 2. Revert the change in Proxmox
# 3. Update Terraform to match
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Import Existing Resources&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Import existing VM
terraform import proxmox_vm_qemu.existing &apos;pve1/qemu/100&apos;

# Then add to your .tf file
resource &quot;proxmox_vm_qemu&quot; &quot;existing&quot; {
  name        = &quot;existing-vm&quot;
  target_node = &quot;pve1&quot;
  # ... match existing config
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Safe Changes&lt;/h2&gt;
&lt;h3&gt;Lifecycle Rules&lt;/h3&gt;
&lt;p&gt;Prevent accidental destruction:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;resource &quot;proxmox_vm_qemu&quot; &quot;production_db&quot; {
  name = &quot;prod-db&quot;
  # ...

  lifecycle {
    prevent_destroy = true

    # Don&apos;t recreate for these changes
    ignore_changes = [
      network,
      disk,
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Plan Before Apply&lt;/h3&gt;
&lt;p&gt;Always review:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Generate plan
terraform plan -out=tfplan

# Review plan file
terraform show tfplan

# Only if plan looks good
terraform apply tfplan
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Targeted Changes&lt;/h3&gt;
&lt;p&gt;Limit blast radius:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Only apply to specific resource
terraform apply -target=module.web_servers

# Only apply to specific instance
terraform apply -target=&apos;module.web_servers[0]&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Variables and Environments&lt;/h2&gt;
&lt;h3&gt;Environment-Specific Variables&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# environments/dev/terraform.tfvars
environment = &quot;dev&quot;
vm_count    = 2
vm_size     = &quot;small&quot;

# environments/prod/terraform.tfvars
environment = &quot;prod&quot;
vm_count    = 5
vm_size     = &quot;large&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Variable Validation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;variable &quot;environment&quot; {
  description = &quot;Environment name&quot;
  type        = string

  validation {
    condition     = contains([&quot;dev&quot;, &quot;staging&quot;, &quot;prod&quot;], var.environment)
    error_message = &quot;Environment must be dev, staging, or prod.&quot;
  }
}

variable &quot;vm_size&quot; {
  description = &quot;VM size preset&quot;
  type        = string
  default     = &quot;small&quot;

  validation {
    condition     = contains([&quot;small&quot;, &quot;medium&quot;, &quot;large&quot;], var.vm_size)
    error_message = &quot;VM size must be small, medium, or large.&quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Size Presets&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# locals.tf
locals {
  vm_sizes = {
    small = {
      cores  = 2
      memory = 2048
      disk   = &quot;32G&quot;
    }
    medium = {
      cores  = 4
      memory = 4096
      disk   = &quot;64G&quot;
    }
    large = {
      cores  = 8
      memory = 8192
      disk   = &quot;128G&quot;
    }
  }

  selected_size = local.vm_sizes[var.vm_size]
}

# Usage
resource &quot;proxmox_vm_qemu&quot; &quot;vm&quot; {
  cores  = local.selected_size.cores
  memory = local.selected_size.memory
  # ...
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Patterns&lt;/h2&gt;
&lt;h3&gt;Count vs For Each&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Count: Simple numbered resources
resource &quot;proxmox_vm_qemu&quot; &quot;worker&quot; {
  count = 3
  name  = &quot;worker-${count.index + 1}&quot;
  # ...
}

# For_each: Named resources (more stable)
variable &quot;vms&quot; {
  default = {
    web    = { ip = &quot;10.10.0.100&quot;, cores = 2 }
    api    = { ip = &quot;10.10.0.101&quot;, cores = 4 }
    worker = { ip = &quot;10.10.0.102&quot;, cores = 2 }
  }
}

resource &quot;proxmox_vm_qemu&quot; &quot;server&quot; {
  for_each = var.vms

  name   = each.key
  cores  = each.value.cores

  ipconfig0 = &quot;ip=${each.value.ip}/24,gw=10.10.0.1&quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;for_each&lt;/code&gt; is safer — removing middle item doesn&apos;t shift others.&lt;/p&gt;
&lt;h3&gt;Dynamic Blocks&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;variable &quot;additional_disks&quot; {
  default = [
    { size = &quot;100G&quot;, storage = &quot;local-zfs&quot; },
    { size = &quot;200G&quot;, storage = &quot;ceph-pool&quot; }
  ]
}

resource &quot;proxmox_vm_qemu&quot; &quot;vm&quot; {
  # ...

  dynamic &quot;disk&quot; {
    for_each = var.additional_disks
    content {
      size    = disk.value.size
      storage = disk.value.storage
      type    = &quot;scsi&quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Conditional Resources&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;variable &quot;create_backup_server&quot; {
  default = false
}

resource &quot;proxmox_vm_qemu&quot; &quot;backup&quot; {
  count = var.create_backup_server ? 1 : 0
  name  = &quot;backup-server&quot;
  # ...
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Debugging&lt;/h2&gt;
&lt;h3&gt;Provider Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;provider &quot;proxmox&quot; {
  pm_log_enable = true
  pm_log_file   = &quot;terraform-plugin-proxmox.log&quot;
  pm_log_levels = {
    _default = &quot;debug&quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Common Issues&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. Template not found:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Error: 500 Configuration file &apos;nodes/pve1/qemu-server/xyz.conf&apos; does not exist
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fix: Verify template name matches exactly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. IP not detected:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Output: default_ipv4_address = &quot;&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fix: Ensure &lt;code&gt;agent = 1&lt;/code&gt; and qemu-guest-agent installed in template.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Disk changes cause recreation:&lt;/strong&gt;
Fix: Add disk to &lt;code&gt;ignore_changes&lt;/code&gt; in lifecycle block.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;IaC is about predictability, not faster clicking.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The goal of Terraform isn&apos;t to create VMs faster than the UI. It&apos;s to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Know what exists&lt;/strong&gt;: Code defines reality&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Know what changed&lt;/strong&gt;: Git history shows when and why&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reproduce reliably&lt;/strong&gt;: Same code = same infrastructure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Collaborate safely&lt;/strong&gt;: Code review before apply&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The patterns that survive:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Modules for reusability&lt;/li&gt;
&lt;li&gt;Remote state for teams&lt;/li&gt;
&lt;li&gt;Lifecycle rules for safety&lt;/li&gt;
&lt;li&gt;Variables for flexibility&lt;/li&gt;
&lt;li&gt;Plan before apply always&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Terraform with Proxmox has rough edges. The provider isn&apos;t perfect. But imperfect IaC beats clicking through a UI every time you need to remember &quot;how did I configure that?&quot;&lt;/p&gt;
</content:encoded><category>proxmox</category><category>automation</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Security &amp; Multi-Tenancy: Roles, Pools, API Tokens, and Isolation</title><link>https://ashimov.com/posts/proxmox-multitenancy/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-multitenancy/</guid><description>Building secure multi-tenant Proxmox environments. Covers RBAC configuration, resource pools, API token management, audit logging, and why access control is a product that requires design.</description><pubDate>Tue, 09 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A single admin with root access works for a homelab. It doesn&apos;t work when multiple people or teams share the same Proxmox cluster. Who can see what? Who can modify what? What happens when someone leaves?&lt;/p&gt;
&lt;p&gt;Access control isn&apos;t a feature you enable. It&apos;s a product you design. Every permission is a decision: who needs this access, why, and what&apos;s the blast radius if it&apos;s misused?&lt;/p&gt;
&lt;p&gt;Proxmox has robust RBAC (Role-Based Access Control). The question is whether you use it intentionally or let it grow organically into chaos.&lt;/p&gt;
&lt;h2&gt;Access Control Model&lt;/h2&gt;
&lt;p&gt;Proxmox permissions combine:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Permission = User/Group + Role + Path

Example:
- User: developer@pve
- Role: PVEVMUser
- Path: /pool/dev-team

Result: developer can use VMs in dev-team pool
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Users and Groups&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Users&lt;/strong&gt;: Individual accounts. Can be in multiple groups.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create user in Proxmox realm
pveum user add developer@pve --password &amp;lt;password&amp;gt;

# Create user in PAM realm (Linux user)
pveum user add admin@pam

# List users
pveum user list
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Groups&lt;/strong&gt;: Collections of users. Simplify permission management.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create group
pveum group add developers --comment &quot;Development team&quot;

# Add user to group
pveum user modify developer@pve --groups developers

# List groups
pveum group list
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Authentication Realms&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Realm&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;pam&lt;/td&gt;
&lt;td&gt;Linux admins who need SSH&lt;/td&gt;
&lt;td&gt;System users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pve&lt;/td&gt;
&lt;td&gt;Web UI only users&lt;/td&gt;
&lt;td&gt;Proxmox internal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ldap&lt;/td&gt;
&lt;td&gt;Enterprise integration&lt;/td&gt;
&lt;td&gt;External directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ad&lt;/td&gt;
&lt;td&gt;Active Directory&lt;/td&gt;
&lt;td&gt;Windows integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For multi-tenancy, usually:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Admins: PAM (SSH + Web UI)&lt;/li&gt;
&lt;li&gt;Regular users: PVE realm (Web UI only)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Built-in Roles&lt;/h2&gt;
&lt;p&gt;Proxmox includes these roles:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Permissions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Administrator&lt;/td&gt;
&lt;td&gt;Everything (dangerous)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEAdmin&lt;/td&gt;
&lt;td&gt;Almost everything (no system access)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEAuditor&lt;/td&gt;
&lt;td&gt;Read-only access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEDatastoreAdmin&lt;/td&gt;
&lt;td&gt;Manage datastores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEDatastoreUser&lt;/td&gt;
&lt;td&gt;Use datastores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEPoolAdmin&lt;/td&gt;
&lt;td&gt;Manage pools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEPoolUser&lt;/td&gt;
&lt;td&gt;Use pools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEVMAdmin&lt;/td&gt;
&lt;td&gt;Full VM control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEVMUser&lt;/td&gt;
&lt;td&gt;Use VMs (console, start/stop)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVETemplateUser&lt;/td&gt;
&lt;td&gt;Clone templates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVEUserAdmin&lt;/td&gt;
&lt;td&gt;Manage users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NoAccess&lt;/td&gt;
&lt;td&gt;Explicit deny&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Custom Roles&lt;/h3&gt;
&lt;p&gt;Create roles for specific needs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create role with specific privileges
pveum role add VMOperator --privs &quot;VM.Console VM.PowerMgmt VM.Monitor&quot;

# List available privileges
pveum privilege list

# View role
pveum role list
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Common custom roles:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Developer: Can create/manage own VMs
pveum role add Developer --privs &quot;VM.Allocate VM.Clone VM.Config.CDROM VM.Config.CPU VM.Config.Cloudinit VM.Config.Disk VM.Config.Memory VM.Config.Network VM.Console VM.Migrate VM.Monitor VM.PowerMgmt VM.Snapshot VM.Snapshot.Rollback Datastore.AllocateSpace&quot;

# Observer: Can view, nothing else
pveum role add Observer --privs &quot;VM.Audit Datastore.Audit&quot;

# Backup Operator: Can backup/restore
pveum role add BackupOperator --privs &quot;VM.Backup VM.Snapshot Datastore.AllocateSpace&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Resource Pools&lt;/h2&gt;
&lt;p&gt;Pools group resources (VMs, storage, nodes) for delegation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create pool
pveum pool add dev-team --comment &quot;Development team resources&quot;

# Add VM to pool
qm set 100 --pool dev-team

# Add storage to pool
pveum pool modify dev-team --storage local-lvm

# List pools
pveum pool list
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pool-Based Permissions&lt;/h3&gt;
&lt;p&gt;Grant access to pool, not individual VMs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Developers can manage VMs in their pool
pveum acl modify /pool/dev-team --users developer@pve --roles Developer

# Or by group
pveum acl modify /pool/dev-team --groups developers --roles Developer
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;New VMs in the pool automatically inherit permissions.&lt;/p&gt;
&lt;h3&gt;Pool Strategy&lt;/h3&gt;
&lt;p&gt;Organize by:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Option 1: By team
/pool/dev-team
/pool/qa-team
/pool/production

Option 2: By environment
/pool/development
/pool/staging
/pool/production

Option 3: By project
/pool/project-alpha
/pool/project-beta
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Match your organization structure.&lt;/p&gt;
&lt;h2&gt;API Tokens&lt;/h2&gt;
&lt;p&gt;API tokens are better than passwords for automation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Separate from user password&lt;/li&gt;
&lt;li&gt;Can have different permissions&lt;/li&gt;
&lt;li&gt;Easily revoked without changing user password&lt;/li&gt;
&lt;li&gt;Audit trail shows token ID&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Creating Tokens&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create token for user
pveum user token add developer@pve automation --privsep 0

# Output shows token secret (save it!)
# Token: developer@pve!automation
# Secret: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

# With privilege separation (token can have fewer privs than user)
pveum user token add admin@pam ci-cd --privsep 1
pveum acl modify /pool/production --tokens admin@pam!ci-cd --roles BackupOperator
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Token Best Practices&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Good: Specific tokens for specific purposes
admin@pam!terraform      # Infrastructure automation
admin@pam!ansible        # Configuration management
admin@pam!monitoring     # Read-only metrics
developer@pve!ci-build   # CI pipeline builds

# Bad: Generic tokens with admin access
admin@pam!api           # Too broad, no purpose documented
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Tokens in Automation&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# API call with token
curl -k -H &quot;Authorization: PVEAPIToken=developer@pve!automation=xxxx-xxxx-xxxx&quot; \
  https://proxmox:8006/api2/json/version

# Terraform provider
provider &quot;proxmox&quot; {
  pm_api_url          = &quot;https://proxmox:8006/api2/json&quot;
  pm_api_token_id     = &quot;terraform@pve!automation&quot;
  pm_api_token_secret = var.proxmox_token
}

# Ansible
proxmox_kvm:
  api_host: proxmox
  api_user: ansible@pve
  api_token_id: automation
  api_token_secret: &quot;{{ vault_proxmox_token }}&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Permission Paths&lt;/h2&gt;
&lt;p&gt;Permissions apply to paths in the resource tree:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/                           # Root (everything)
├── /access                 # User/group management
├── /nodes                  # All nodes
│   ├── /nodes/pve1         # Specific node
├── /pool                   # All pools
│   └── /pool/dev-team      # Specific pool
├── /storage                # All storage
│   └── /storage/local      # Specific storage
└── /vms                    # All VMs (by ID)
    └── /vms/100            # Specific VM
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Permission Inheritance&lt;/h3&gt;
&lt;p&gt;Permissions cascade down:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Grant access to all VMs in a pool
pveum acl modify /pool/dev-team --users developer@pve --roles PVEVMUser

# Developer can now access:
# - /pool/dev-team (pool itself)
# - All VMs in that pool
# - Storage assigned to that pool
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explicit Deny&lt;/h3&gt;
&lt;p&gt;NoAccess role blocks inheritance:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# User has pool access
pveum acl modify /pool/dev-team --users developer@pve --roles Developer

# But NOT this specific VM
pveum acl modify /vms/105 --users developer@pve --roles NoAccess
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multi-Tenant Architecture&lt;/h2&gt;
&lt;h3&gt;Example: Web Hosting Provider&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Tenants: customer-a, customer-b, customer-c

Structure:
├── /pool/customer-a
│   ├── VMs 100-199
│   └── Storage quota
├── /pool/customer-b
│   ├── VMs 200-299
│   └── Storage quota
└── /pool/customer-c
    ├── VMs 300-399
    └── Storage quota

Users:
- customer-a-admin@pve → /pool/customer-a (PVEVMAdmin)
- customer-a-user@pve  → /pool/customer-a (PVEVMUser)
- customer-b-admin@pve → /pool/customer-b (PVEVMAdmin)
...

Isolation:
- Network: Separate VLANs per customer
- Storage: Pool quotas, separate datastores
- Compute: Resource limits on pools
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example: Corporate IT&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Departments: dev, qa, production, infrastructure

Structure:
├── /pool/development
│   └── All non-prod VMs
├── /pool/qa
│   └── Test environments
├── /pool/production
│   └── Production workloads (restricted)
└── /pool/infrastructure
    └── DNS, monitoring, etc.

Groups and roles:
- developers     → /pool/development (Developer)
- qa-engineers   → /pool/qa (Developer)
- sre-team       → /pool/production (PVEVMUser)
- sre-leads      → /pool/production (PVEVMAdmin)
- infra-admins   → / (PVEAdmin)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Audit Logging&lt;/h2&gt;
&lt;p&gt;Track who did what:&lt;/p&gt;
&lt;h3&gt;Task History&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Recent tasks (via API)
pvesh get /cluster/tasks

# Node-specific tasks
pvesh get /nodes/pve1/tasks
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Web UI: Datacenter → Tasks → Filter by user&lt;/p&gt;
&lt;h3&gt;System Logs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Auth logs
journalctl -u pveproxy | grep auth

# API access logs
tail -f /var/log/pveproxy/access.log
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;External Audit&lt;/h3&gt;
&lt;p&gt;For compliance, forward logs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Syslog forwarding
echo &quot;*.* @syslog-server:514&quot; &amp;gt;&amp;gt; /etc/rsyslog.d/remote.conf
systemctl restart rsyslog
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Quotas and Limits&lt;/h2&gt;
&lt;p&gt;Prevent resource exhaustion:&lt;/p&gt;
&lt;h3&gt;Pool Quotas&lt;/h3&gt;
&lt;p&gt;Not built-in, but enforceable via custom roles and monitoring:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create role without VM.Allocate
pveum role add PoolUser --privs &quot;VM.Console VM.PowerMgmt VM.Monitor&quot;

# Users can use VMs but not create new ones
# Admins create VMs, assign to pool
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VM Resource Limits&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Limit CPU
qm set 100 --cpulimit 2  # Max 2 cores worth

# Limit memory
qm set 100 --memory 4096 --balloon 2048

# Limit disk I/O
qm set 100 --bwlimit &quot;backup=10240,restore=10240&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Storage Quotas&lt;/h3&gt;
&lt;p&gt;Ceph/ZFS can enforce quotas:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# ZFS quota
zfs set quota=100G rpool/data/customer-a

# Ceph quota
ceph osd pool set-quota customer-pool max_bytes 107374182400
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Security Checklist&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;User Management:
[ ] No shared accounts
[ ] Each person has individual user
[ ] Users in appropriate groups
[ ] Unused users disabled/deleted

Roles:
[ ] Custom roles for common use cases
[ ] No one uses Administrator role directly
[ ] Principle of least privilege applied

Pools:
[ ] Resources organized into pools
[ ] Permissions at pool level (not individual VMs)
[ ] Clear ownership per pool

API Tokens:
[ ] Automation uses tokens, not passwords
[ ] Tokens have specific purposes
[ ] Tokens documented
[ ] Unused tokens revoked

Audit:
[ ] Logs retained appropriately
[ ] Regular review of access
[ ] Alerts on sensitive operations
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Token Rotation&lt;/h2&gt;
&lt;p&gt;Regular token rotation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create new token
pveum user token add admin@pam ansible-v2 --privsep 0

# Update automation to use new token

# Verify new token works

# Remove old token
pveum user token remove admin@pam ansible-v1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule this quarterly or when personnel changes.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Access control is a product. It needs to be designed.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The lazy approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Everyone is admin&lt;/li&gt;
&lt;li&gt;One shared account&lt;/li&gt;
&lt;li&gt;Permissions &quot;we&apos;ll figure out later&quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No audit trail&lt;/li&gt;
&lt;li&gt;Blast radius is entire cluster&lt;/li&gt;
&lt;li&gt;Personnel change = security nightmare&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The designed approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Users in groups&lt;/li&gt;
&lt;li&gt;Groups have roles&lt;/li&gt;
&lt;li&gt;Roles are minimal&lt;/li&gt;
&lt;li&gt;Resources in pools&lt;/li&gt;
&lt;li&gt;Permissions at pool level&lt;/li&gt;
&lt;li&gt;Automation uses tokens&lt;/li&gt;
&lt;li&gt;Regular access reviews&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Access control isn&apos;t overhead — it&apos;s what makes multi-tenancy possible. Design it upfront, enforce it consistently, and review it regularly.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>automation</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Ceph on Proxmox: Honest Guide (When It&apos;s Worth It, When It&apos;s Pain)</title><link>https://ashimov.com/posts/proxmox-ceph/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-ceph/</guid><description>Real talk about Ceph on Proxmox. Covers minimum requirements, network design, OSD configuration, recovery behavior, performance expectations, and why Ceph is great when you accept its costs.</description><pubDate>Fri, 05 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Ceph is incredible technology. Distributed, self-healing storage that scales horizontally. No single point of failure. Built into Proxmox with a nice UI. Sounds perfect.&lt;/p&gt;
&lt;p&gt;It&apos;s not perfect. Ceph has costs: hardware (more nodes, more disks, more network), complexity (distributed systems are hard), and operational overhead (recovery can saturate your network). These costs are worth it for the right use case. For the wrong use case, Ceph is pain with no benefit.&lt;/p&gt;
&lt;p&gt;This is an honest guide: when Ceph makes sense, what it really requires, and what to expect.&lt;/p&gt;
&lt;h2&gt;When Ceph Makes Sense&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3+ nodes with dedicated storage networks&lt;/li&gt;
&lt;li&gt;Need for truly shared storage (HA, live migration)&lt;/li&gt;
&lt;li&gt;Can accept Ceph&apos;s resource overhead&lt;/li&gt;
&lt;li&gt;Want to scale storage by adding nodes&lt;/li&gt;
&lt;li&gt;No external SAN/NAS available&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Bad fit:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single node (Ceph needs 3+ for reliability)&lt;/li&gt;
&lt;li&gt;Tight hardware budget (Ceph needs resources)&lt;/li&gt;
&lt;li&gt;Simple backup/restore is sufficient&lt;/li&gt;
&lt;li&gt;Already have enterprise SAN&lt;/li&gt;
&lt;li&gt;Can&apos;t dedicate network bandwidth&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Minimum Requirements (Real Minimums)&lt;/h2&gt;
&lt;h3&gt;Nodes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Minimum: 3 nodes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Ceph uses replication (default 3 copies). With 2 nodes, one failure = data at risk.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;3 nodes: Can lose 1 node
4 nodes: Can lose 1 node (more capacity)
5 nodes: Can lose 2 nodes
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CPU and RAM&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Per OSD (disk):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1 CPU core minimum&lt;/li&gt;
&lt;li&gt;2GB RAM minimum (4GB recommended)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;Example: 3 nodes × 4 OSDs each = 12 OSDs
Minimum: 12 cores, 24GB RAM just for Ceph
Recommended: 24 cores, 48GB RAM

Plus RAM for VMs, Proxmox, monitors...
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Network&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Minimum: 10GbE dedicated&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;1GbE works for testing but not production. Recovery after disk failure saturates the network:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1TB disk fails, recovery speed:
- 1GbE: ~3 hours (if nothing else uses network)
- 10GbE: ~15 minutes

During recovery, performance is degraded
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Recommended: Separate networks&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────────────────────────────┐
│          Public Network             │
│  (client access, VM traffic)        │
│           10.0.0.0/24               │
└─────────────┬───────────────────────┘
              │
        ┌─────┼─────┐
        │     │     │
     pve1   pve2   pve3
        │     │     │
        └─────┼─────┘
              │
┌─────────────┴───────────────────────┐
│         Cluster Network             │
│  (OSD replication, recovery)        │
│          10.10.0.0/24               │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cluster network handles heavy replication traffic. Public network serves VMs.&lt;/p&gt;
&lt;h3&gt;Storage&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;SSDs strongly recommended&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;HDDs work but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Recovery is slow (hours to days)&lt;/li&gt;
&lt;li&gt;Random I/O performance is poor&lt;/li&gt;
&lt;li&gt;Write latency affects all VMs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Flash for metadata&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If using HDDs, use SSDs for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;WAL (Write-Ahead Log)&lt;/li&gt;
&lt;li&gt;DB (RocksDB metadata)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;HDD OSD with SSD metadata:
Performance: 10x better than HDD-only
Complexity: Higher, more failure modes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Installing Ceph on Proxmox&lt;/h2&gt;
&lt;h3&gt;Initialize Ceph&lt;/h3&gt;
&lt;p&gt;From any node (installs Ceph packages):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pveceph install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: Node → Ceph → Install&lt;/p&gt;
&lt;h3&gt;Create Ceph Cluster&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Initialize on first node
pveceph init --network 10.10.0.0/24

# This sets cluster network
# Default: uses same as public (not recommended)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Create Monitors&lt;/h3&gt;
&lt;p&gt;Each node needs a monitor for quorum:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On each node
pveceph mon create

# Verify
ceph mon stat
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Need at least 3 monitors for quorum (one per node in 3-node cluster).&lt;/p&gt;
&lt;h3&gt;Create Manager Daemons&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On each node (2+ recommended)
pveceph mgr create

# Verify
ceph mgr stat
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Create OSDs&lt;/h3&gt;
&lt;p&gt;Each disk becomes an OSD:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# List available disks
ceph-volume lvm list

# Create OSD on /dev/sdb
pveceph osd create /dev/sdb

# With separate WAL/DB device (SSD for HDD OSDs)
pveceph osd create /dev/sdb --wal-dev /dev/nvme0n1 --db-dev /dev/nvme0n1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Via Web UI: Node → Ceph → OSD → Create OSD&lt;/p&gt;
&lt;h3&gt;Create Pool&lt;/h3&gt;
&lt;p&gt;Pools contain data with specific replication rules:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create pool with size 3 (3 replicas), min_size 2
pveceph pool create vmpool --size 3 --min_size 2 --pg_num 128

# Add as Proxmox storage
pvesm add rbd ceph-pool --pool vmpool --content images,rootdir
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Ceph Configuration&lt;/h2&gt;
&lt;h3&gt;Understanding PGs (Placement Groups)&lt;/h3&gt;
&lt;p&gt;PGs distribute data across OSDs. Too few = uneven distribution. Too many = overhead.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Rule of thumb:
Total PGs = (OSDs × 100) / replica count

12 OSDs, 3 replicas:
(12 × 100) / 3 = 400 PGs per pool

Divide among pools based on expected size
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pool Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check pool settings
ceph osd pool get vmpool all

# Adjust replication
ceph osd pool set vmpool size 3
ceph osd pool set vmpool min_size 2

# Enable compression (optional)
ceph osd pool set vmpool compression_mode aggressive
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CRUSH Rules&lt;/h3&gt;
&lt;p&gt;CRUSH determines data placement. Default spreads across hosts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# View CRUSH map
ceph osd crush tree

# Data placement: 1 replica per host
# Protects against single host failure
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For single-node testing (NOT production):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Allow replicas on same host (DANGEROUS)
ceph osd crush rule create-replicated single_host default osd
ceph osd pool set vmpool crush_rule single_host
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Monitoring Ceph Health&lt;/h2&gt;
&lt;h3&gt;Basic Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Overall health
ceph status

# Should show:
# health: HEALTH_OK

# Detailed health
ceph health detail
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;OSD Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# OSD tree
ceph osd tree

# OSD stats
ceph osd stat

# Individual OSD
ceph osd perf
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pool Usage&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Pool stats
ceph df

# Detailed pool info
rados df
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Dashboard&lt;/h3&gt;
&lt;p&gt;Enable Ceph dashboard:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Dashboard is included with Proxmox
# Access via: https://&amp;lt;node&amp;gt;:8006 → Node → Ceph → Status
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What to Expect: Performance&lt;/h2&gt;
&lt;h3&gt;Write Performance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Single SSD OSD: ~50-100 MB/s per OSD
Single NVMe OSD: ~200-500 MB/s per OSD
Aggregate: Scales with OSDs

Latency: 1-5ms (SSD), 5-20ms (HDD)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Read Performance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Read from primary OSD, scales with OSDs
Cache helps repeated reads
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Real-World VM Performance&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Random 4K IOPS (single VM):
- Ceph SSD: 5,000-20,000 IOPS
- Local SSD: 50,000-100,000 IOPS

Latency matters more than throughput for VMs
Ceph adds network round-trip to every I/O
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ceph won&apos;t match local NVMe performance. It provides redundancy and shared access, not speed.&lt;/p&gt;
&lt;h2&gt;Recovery Behavior&lt;/h2&gt;
&lt;h3&gt;When an OSD Fails&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Ceph detects OSD down (10-30 seconds)
2. Marks OSD out (default: 5 minutes)
3. Begins recovery (re-replicating data)
4. Recovery uses cluster network bandwidth
5. Performance degraded until complete
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Recovery Impact&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1TB OSD failure:
- Data to re-replicate: 1TB
- 10GbE network: ~15 minutes
- 1GbE network: ~3 hours
- During recovery: Degraded performance
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Tuning Recovery&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Limit recovery bandwidth (default is aggressive)
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_sleep 0.1

# Check recovery status
ceph status
# Should show recovery progress
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Balance: Fast recovery vs. production performance impact.&lt;/p&gt;
&lt;h2&gt;Common Problems&lt;/h2&gt;
&lt;h3&gt;HEALTH_WARN: Too Few PGs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Increase PGs
ceph osd pool set vmpool pg_num 256
ceph osd pool set vmpool pgp_num 256
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;HEALTH_WARN: OSDs Near Full&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check usage
ceph osd df

# Options:
# 1. Add more OSDs
# 2. Delete data
# 3. Rebalance (if uneven)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ceph stops writes at 95% full. Plan capacity.&lt;/p&gt;
&lt;h3&gt;Slow Requests&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check for slow ops
ceph daemon osd.0 ops

# Common causes:
# - HDD latency
# - Network congestion
# - Undersized cluster
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Clock Skew&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Monitors are sensitive to time
# Check NTP
timedatectl status

# Fix: Ensure NTP is working on all nodes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Ceph vs Alternatives&lt;/h2&gt;
&lt;h3&gt;Ceph vs Local ZFS&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Ceph&lt;/th&gt;
&lt;th&gt;Local ZFS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Redundancy&lt;/td&gt;
&lt;td&gt;Across nodes&lt;/td&gt;
&lt;td&gt;Within node (mirror/RAIDZ)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared storage&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (without replication)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Network-bound&lt;/td&gt;
&lt;td&gt;Local disk speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node failure&lt;/td&gt;
&lt;td&gt;VMs continue&lt;/td&gt;
&lt;td&gt;VMs stop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Choose local ZFS&lt;/strong&gt; if you don&apos;t need shared storage.&lt;/p&gt;
&lt;h3&gt;Ceph vs NFS&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Ceph&lt;/th&gt;
&lt;th&gt;NFS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Redundancy&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Requires HA NFS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Parallel access&lt;/td&gt;
&lt;td&gt;Single server bottleneck&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling&lt;/td&gt;
&lt;td&gt;Add nodes&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Choose NFS&lt;/strong&gt; for simpler setups with existing NAS.&lt;/p&gt;
&lt;h3&gt;Ceph vs iSCSI SAN&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Ceph&lt;/th&gt;
&lt;th&gt;SAN&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Hardware only&lt;/td&gt;
&lt;td&gt;Hardware + licensing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling&lt;/td&gt;
&lt;td&gt;Add nodes&lt;/td&gt;
&lt;td&gt;Add shelves/licenses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Self-managed&lt;/td&gt;
&lt;td&gt;Vendor support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Often better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Choose SAN&lt;/strong&gt; if budget allows and you want vendor support.&lt;/p&gt;
&lt;h2&gt;Sizing Example&lt;/h2&gt;
&lt;h3&gt;Small Production Cluster&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;3 nodes:
- 32GB RAM each (16GB Ceph, 16GB VMs)
- 4-core CPU each
- 4× 1TB SSD per node (12 OSDs total)
- 10GbE cluster network
- 10GbE public network

Usable storage: ~4TB (12TB raw ÷ 3 replicas)
VM capacity: ~20-40 VMs depending on size
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Medium Production Cluster&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;5 nodes:
- 128GB RAM each
- 16-core CPU each
- 8× 2TB NVMe per node (40 OSDs total)
- 25GbE cluster network
- 10GbE public network

Usable storage: ~26TB (80TB raw ÷ 3 replicas)
VM capacity: ~100-200 VMs
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Ceph is great when you accept its costs: hardware, network, and operational complexity.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Ceph provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;True shared storage&lt;/li&gt;
&lt;li&gt;Self-healing&lt;/li&gt;
&lt;li&gt;Horizontal scaling&lt;/li&gt;
&lt;li&gt;No single point of failure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Ceph costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3+ nodes minimum&lt;/li&gt;
&lt;li&gt;Significant RAM (2-4GB per OSD)&lt;/li&gt;
&lt;li&gt;10GbE+ network (dedicated)&lt;/li&gt;
&lt;li&gt;Operational knowledge&lt;/li&gt;
&lt;li&gt;Recovery impacts performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For a 3-node homelab with 10GbE networking, Ceph is a solid choice. For a single node, Ceph is pointless complexity. For a budget cluster with 1GbE, Ceph will frustrate you.&lt;/p&gt;
&lt;p&gt;Match the tool to the problem. Ceph solves &quot;I need shared, redundant storage across multiple nodes.&quot; If that&apos;s not your problem, Ceph isn&apos;t your solution.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>ha</category><category>storage</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>High Availability: HA Groups, Fencing Mindset, and Failure Testing</title><link>https://ashimov.com/posts/proxmox-ha/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-ha/</guid><description>Proxmox HA done right. Covers HA manager configuration, fencing requirements, groups and priorities, maintenance procedures, failure testing, and why HA without tests is just a checkbox.</description><pubDate>Tue, 02 Sep 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;High availability sounds like a feature you enable. Click &quot;HA,&quot; and VMs automatically restart when a node fails. Magic.&lt;/p&gt;
&lt;p&gt;It&apos;s not magic. It&apos;s fencing, quorum, shared storage, and very specific failure handling. Get any of these wrong and HA either doesn&apos;t work, or worse — causes split-brain where VMs run on multiple nodes simultaneously, corrupting data.&lt;/p&gt;
&lt;p&gt;HA without testing is just a checkbox. A checkbox that might destroy your data when you actually need it.&lt;/p&gt;
&lt;h2&gt;HA Prerequisites&lt;/h2&gt;
&lt;p&gt;Before enabling HA, you need:&lt;/p&gt;
&lt;h3&gt;1. Cluster (3+ Nodes Recommended)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check cluster status
pvecm status

# Need quorum for HA decisions
# 2 nodes = no node can fail without losing quorum
# 3 nodes = 1 node can fail
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Two-node clusters need a QDevice for HA to work reliably.&lt;/p&gt;
&lt;h3&gt;2. Shared Storage&lt;/h3&gt;
&lt;p&gt;HA VMs must be on storage accessible from all nodes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check shared storage
pvesm status

# Valid for HA:
# - Ceph (RBD)
# - NFS
# - iSCSI
# - GlusterFS

# NOT valid:
# - local
# - local-lvm
# - local-zfs (unless Ceph ZFS)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If storage isn&apos;t shared, VM can&apos;t start on another node.&lt;/p&gt;
&lt;h3&gt;3. Fencing Capability&lt;/h3&gt;
&lt;p&gt;Fencing ensures a failed node is truly dead before starting VMs elsewhere. Without fencing, you risk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Node 1: Appears dead (network issue)
Node 2: Starts VM copy
Node 1: Actually alive, VM still running
Result: Two VMs, same disk, corruption
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Fencing (The Critical Part)&lt;/h2&gt;
&lt;h3&gt;What Fencing Does&lt;/h3&gt;
&lt;p&gt;Fencing forces a failed node to stop before HA restarts VMs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Node detected as failed&lt;/li&gt;
&lt;li&gt;HA manager tries to fence (kill) the node&lt;/li&gt;
&lt;li&gt;Only after successful fence → start VMs on other node&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Fencing Methods&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Hardware fencing (recommended):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;IPMI/iLO/DRAC power off&lt;/li&gt;
&lt;li&gt;PDU power cut&lt;/li&gt;
&lt;li&gt;SBD (Storage-Based Death)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Software fencing:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Watchdog timer (self-fence)&lt;/li&gt;
&lt;li&gt;SSH fence (tell node to shutdown)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configuring Watchdog Fencing&lt;/h3&gt;
&lt;p&gt;Most common in homelab. Node kills itself if it loses quorum:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Enable hardware watchdog
echo &quot;softdog&quot; &amp;gt;&amp;gt; /etc/modules

# Load module
modprobe softdog

# Verify
ls /dev/watchdog
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Proxmox HA uses watchdog automatically. If node loses quorum and can&apos;t reach cluster, watchdog triggers reboot.&lt;/p&gt;
&lt;h3&gt;IPMI Fencing (Production)&lt;/h3&gt;
&lt;p&gt;For reliable fencing, use IPMI:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Install fence agents
apt install fence-agents

# Test IPMI fencing manually
ipmitool -H 10.0.0.200 -U admin -P password power status
ipmitool -H 10.0.0.200 -U admin -P password power off
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configure in &lt;code&gt;/etc/pve/ha/fence.cfg&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Fence configuration
# Not directly supported in PVE GUI, but can use with custom scripts
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Storage-Based Fencing (SBD)&lt;/h3&gt;
&lt;p&gt;Nodes write heartbeats to shared storage. Missing heartbeat = fence:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create SBD device on shared storage
sbd -d /dev/sdb create

# Configure SBD
sbd -d /dev/sdb -1 60 -4 120 create
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Enabling HA for VMs&lt;/h2&gt;
&lt;h3&gt;Add VM to HA&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enable HA for VM 100
ha-manager add vm:100

# With specific group
ha-manager add vm:100 --group production

# Check HA status
ha-manager status
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Via Web UI: Datacenter → HA → Add → Select VM&lt;/p&gt;
&lt;h3&gt;HA States&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;started&lt;/td&gt;
&lt;td&gt;HA will ensure VM is running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stopped&lt;/td&gt;
&lt;td&gt;HA will ensure VM is stopped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;disabled&lt;/td&gt;
&lt;td&gt;HA ignores this VM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ignored&lt;/td&gt;
&lt;td&gt;Temporarily ignore (migration)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;HA Groups&lt;/h3&gt;
&lt;p&gt;Groups define which nodes can run HA VMs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create group preferring pve1 and pve2
ha-manager groupadd production --nodes pve1,pve2

# Add VM to group
ha-manager set vm:100 --group production

# Node priority (lower = preferred)
ha-manager groupadd production --nodes pve1:1,pve2:2,pve3:3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With priorities, VMs prefer pve1, failover to pve2, last resort pve3.&lt;/p&gt;
&lt;h3&gt;Restricted Groups&lt;/h3&gt;
&lt;p&gt;Only allow VMs on specific nodes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create restricted group
ha-manager groupadd gpu-nodes --nodes pve2,pve3 --restricted

# VMs in this group can ONLY run on pve2 or pve3
ha-manager set vm:200 --group gpu-nodes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Useful for VMs needing specific hardware (GPU, special storage).&lt;/p&gt;
&lt;h2&gt;HA Manager Behavior&lt;/h2&gt;
&lt;h3&gt;Node Failure Sequence&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Node stops responding to cluster heartbeats
2. Other nodes detect failure (after timeout)
3. Quorum check: Do remaining nodes have majority?
4. If quorate:
   a. Attempt to fence failed node
   b. Wait for fence confirmation
   c. Start VMs on surviving nodes
5. If not quorate:
   a. Cluster freezes
   b. No HA actions (prevents split-brain)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Failover Timing&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Detection timeout:    30 seconds (default)
Fence attempt:        Variable (IPMI: seconds, watchdog: 60s)
VM startup:           10-60 seconds

Total failover time:  1-3 minutes typical
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For faster failover, tune detection but beware false positives.&lt;/p&gt;
&lt;h3&gt;Resource Migration&lt;/h3&gt;
&lt;p&gt;When node comes back online, VMs don&apos;t automatically migrate back:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# VMs stay on failover node until:
# 1. Manual migration
# 2. Next failure
# 3. Maintenance mode + recovery

# To migrate back manually
qm migrate 100 pve1 --online
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is intentional. Automatic &quot;failback&quot; risks unnecessary disruption.&lt;/p&gt;
&lt;h2&gt;Maintenance Mode&lt;/h2&gt;
&lt;p&gt;Before working on a node, use maintenance mode:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Request maintenance (HA migrates VMs away)
ha-manager set-maintenance pve1 --enable

# Check status
ha-manager status

# Wait for migrations to complete
# Do maintenance work

# Disable maintenance
ha-manager set-maintenance pve1 --disable
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gracefully moves VMs, unlike a failure which is disruptive.&lt;/p&gt;
&lt;h3&gt;Manual VM Migration&lt;/h3&gt;
&lt;p&gt;For HA VMs, use:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Request HA to migrate
ha-manager migrate vm:100 pve2

# Or set VM to ignored temporarily
ha-manager set vm:100 --state ignored
qm migrate 100 pve2 --online
ha-manager set vm:100 --state started
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Don&apos;t just &lt;code&gt;qm migrate&lt;/code&gt; an HA VM — HA manager might fight you.&lt;/p&gt;
&lt;h2&gt;Testing HA (Critical)&lt;/h2&gt;
&lt;h3&gt;Test 1: Simulated Node Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On node to &quot;fail&quot;
systemctl stop pve-cluster corosync

# Watch from another node
ha-manager status

# VMs should migrate to other nodes
# After 1-2 minutes, check VMs are running elsewhere

# Restore node
systemctl start corosync pve-cluster
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 2: Hard Power Off&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;: These commands immediately crash the node without graceful shutdown.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Physical power button or:
echo b &amp;gt; /proc/sysrq-trigger  # Immediate reboot (no sync)

# Or IPMI (preferred for remote testing):
ipmitool chassis power off

# This tests actual fencing behavior
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 3: Network Partition&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On node, drop cluster traffic
iptables -A INPUT -p udp --dport 5405:5412 -j DROP
iptables -A OUTPUT -p udp --dport 5405:5412 -j DROP

# Node should fence itself (watchdog) or be fenced (IPMI)
# VMs should migrate

# Restore
iptables -F
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Test 4: Storage Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# If using NFS, unmount it
umount -l /mnt/nfs-storage

# HA behavior depends on configuration
# VMs using that storage should fail
# Other VMs should continue

# Document what happens!
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Document Test Results&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;HA Test Report - 2025-01-08

Test: Node power off (pve2)
Method: IPMI power off
Expected: VMs 100, 101 migrate to pve1 or pve3

Timeline:
- 00:00 Power off pve2
- 00:32 Cluster detects failure
- 00:45 Fence confirmed
- 01:15 VM 100 started on pve1
- 01:28 VM 101 started on pve3

Total failover: 1 minute 28 seconds
Result: PASS

Issues: None
Tested by: Admin
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common HA Problems&lt;/h2&gt;
&lt;h3&gt;&quot;No quorum&quot; — Nothing Happens&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check quorum
pvecm status | grep Quorate

# If &quot;Quorate: No&quot;, cluster can&apos;t make decisions
# Need majority of nodes online
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Fix: Add more nodes, add QDevice, or manually set expected votes (dangerous).&lt;/p&gt;
&lt;h3&gt;VMs Won&apos;t Start After Failover&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check HA manager logs
journalctl -u pve-ha-lrm -f

# Common causes:
# - Shared storage not available
# - Resource constraints (RAM, CPU)
# - Start dependencies
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Split-Brain Detected&lt;/h3&gt;
&lt;p&gt;If somehow VMs ran on multiple nodes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# IMMEDIATELY stop VMs on one node
qm stop 100 --skiplock

# Check for disk corruption
# Restore from backup if needed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is catastrophic. Prevent with proper fencing.&lt;/p&gt;
&lt;h3&gt;HA Service Stuck&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Restart HA services
systemctl restart pve-ha-crm
systemctl restart pve-ha-lrm

# Check status
ha-manager status
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;HA Architecture&lt;/h2&gt;
&lt;h3&gt;Minimum Viable HA&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;3 nodes minimum (for quorum)
Shared storage (NFS, Ceph, iSCSI)
Fencing (watchdog at minimum)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Production HA&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;3+ nodes
Redundant network (bonding)
Dedicated cluster network
Ceph or enterprise SAN
Hardware fencing (IPMI)
UPS with monitoring
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;HA Network Topology&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;         ┌───────────────────────────────────┐
         │        Cluster Network            │
         │    (Corosync, fencing, HA)        │
         └───────────┬───────────┬───────────┘
                     │           │
         ┌───────────┴───┐   ┌───┴───────────┐
         │     pve1      │   │     pve2      │
         │   (node 1)    │   │   (node 2)    │
         └───────┬───────┘   └───────┬───────┘
                 │                   │
         ┌───────┴───────────────────┴───────┐
         │           Storage Network          │
         │        (Ceph, iSCSI, NFS)          │
         └───────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Separate networks for cluster and storage prevents storage issues from affecting HA decisions.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;HA without tests is just a checkbox.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Enabling HA takes 30 seconds. Testing it takes hours. But that testing is what determines whether HA works when you need it.&lt;/p&gt;
&lt;p&gt;The checkbox says &quot;HA enabled.&quot; The test proves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fencing actually works&lt;/li&gt;
&lt;li&gt;VMs actually migrate&lt;/li&gt;
&lt;li&gt;Storage is actually shared&lt;/li&gt;
&lt;li&gt;Recovery time meets requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Every HA setup has edge cases. The node that takes 5 minutes to fence. The VM that won&apos;t start because of resource constraints. The storage path that fails under load.&lt;/p&gt;
&lt;p&gt;You find these in testing, or you find them in production. Testing is cheaper.&lt;/p&gt;
&lt;p&gt;Schedule regular HA tests. Document what happens. Fix what&apos;s broken. That&apos;s how you turn a checkbox into actual high availability.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>ha</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Snapshots vs Backups vs Replication: What Saved Me and What Didn&apos;t</title><link>https://ashimov.com/posts/proxmox-recovery/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-recovery/</guid><description>Understanding data protection layers in Proxmox. Covers snapshots, backups, and replication with real failure scenarios, RPO/RTO planning, and why replication is not a replacement for backups.</description><pubDate>Fri, 29 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I&apos;ve lost data three times in production. Each time taught me something different about what &quot;protected&quot; actually means.&lt;/p&gt;
&lt;p&gt;First time: snapshot on same disk that failed. Snapshot died with the disk.
Second time: backup existed, but retention policy had pruned the version I needed.
Third time: replication was running, but it replicated the corruption.&lt;/p&gt;
&lt;p&gt;Snapshots, backups, and replication are different tools solving different problems. Using the wrong one for your failure scenario means learning the hard way.&lt;/p&gt;
&lt;h2&gt;The Three Protection Layers&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Snapshot&lt;/th&gt;
&lt;th&gt;Backup&lt;/th&gt;
&lt;th&gt;Replication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Location&lt;/td&gt;
&lt;td&gt;Same storage&lt;/td&gt;
&lt;td&gt;Different storage&lt;/td&gt;
&lt;td&gt;Different node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Instant&lt;/td&gt;
&lt;td&gt;Minutes-hours&lt;/td&gt;
&lt;td&gt;Continuous&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Protection&lt;/td&gt;
&lt;td&gt;Human error&lt;/td&gt;
&lt;td&gt;Hardware failure&lt;/td&gt;
&lt;td&gt;Site failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Point-in-time&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Near-real-time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Survives disk failure&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Depends&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Survives site failure&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;If off-site&lt;/td&gt;
&lt;td&gt;If different site&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each layer protects against different failures. You need all three.&lt;/p&gt;
&lt;h2&gt;Snapshots&lt;/h2&gt;
&lt;p&gt;A snapshot captures VM state at a point in time — disk and optionally RAM.&lt;/p&gt;
&lt;h3&gt;How Snapshots Work&lt;/h3&gt;
&lt;p&gt;ZFS/LVM snapshots are copy-on-write:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Before snapshot:
  Disk blocks: [A][B][C][D][E]

After snapshot:
  Current:     [A][B][C][D][E]
  Snapshot:    → points to same blocks

After modification (block C changed):
  Current:     [A][B][C&apos;][D][E]
  Snapshot:    [A][B][C][D][E]  (old C preserved)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Snapshots are instant because nothing is copied initially. Only changed blocks are preserved.&lt;/p&gt;
&lt;h3&gt;Creating Snapshots&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VM snapshot (disk + state)
qm snapshot 100 before-upgrade --description &quot;Before kernel upgrade&quot;

# List snapshots
qm listsnapshot 100

# Rollback
qm rollback 100 before-upgrade

# Delete snapshot
qm delsnapshot 100 before-upgrade
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What Snapshots Are Good For&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Before risky changes&lt;/strong&gt;: Upgrade, config change, experimental work&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quick rollback&lt;/strong&gt;: &quot;Oops, that broke it&quot; → 30-second recovery&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Testing&lt;/strong&gt;: Try something, snapshot, try variations, rollback&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Snapshots Don&apos;t Protect Against&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Failure scenario: Disk dies
Snapshots: Also dead (same disk)
Result: Total loss

Failure scenario: Storage controller fails
Snapshots: Also dead (same storage)
Result: Total loss

Failure scenario: Ransomware encrypts VM
Snapshots: Might survive if attacker doesn&apos;t find them
Result: Maybe recoverable, maybe not

Failure scenario: Accidental snapshot deletion
Snapshots: Gone
Result: No protection
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Rule: Snapshots are convenience, not protection.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Backups&lt;/h2&gt;
&lt;p&gt;Backups copy data to separate storage.&lt;/p&gt;
&lt;h3&gt;Backup to PBS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Full backup to PBS
vzdump 100 --storage pbs-store --mode snapshot --compress zstd

# Incremental (only changed since last)
# PBS does this automatically
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What Backups Protect Against&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Failure scenario: Primary storage dies
Backups on PBS: Safe
Result: Restore from backup

Failure scenario: Host fails completely
Backups on PBS: Safe (different hardware)
Result: Restore to new host

Failure scenario: Accidental VM deletion
Backups: Safe (separate system)
Result: Restore deleted VM

Failure scenario: Ransomware encrypts VM
Backups: Safe if not mounted/accessible to VM
Result: Restore clean version
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Backup Limitations&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Failure scenario: Backup storage also fails
Result: Both copies lost

Failure scenario: Retention pruned the backup you need
Result: Can&apos;t restore that point in time

Failure scenario: Site-wide disaster (fire, flood)
On-site backups: Also destroyed
Result: Total loss without off-site copy
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;RPO: Recovery Point Objective&lt;/h3&gt;
&lt;p&gt;How much data can you lose?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Daily backups at 1 AM:
- Failure at 11 PM = 22 hours of data loss
- RPO = 24 hours

Hourly backups:
- Maximum 1 hour of data loss
- RPO = 1 hour

Real-time replication:
- Seconds of data loss
- RPO ≈ 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Match backup frequency to acceptable data loss.&lt;/p&gt;
&lt;h3&gt;RTO: Recovery Time Objective&lt;/h3&gt;
&lt;p&gt;How fast must you recover?&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Full VM restore from PBS:
- 100GB VM ≈ 10-30 minutes
- RTO ≈ 30 minutes

Restore from off-site:
- Download time + restore time
- RTO = hours

Rebuild from scratch + restore data:
- RTO = hours to days
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Match recovery method to acceptable downtime.&lt;/p&gt;
&lt;h2&gt;Replication&lt;/h2&gt;
&lt;p&gt;Replication continuously copies data to another location.&lt;/p&gt;
&lt;h3&gt;Proxmox Replication&lt;/h3&gt;
&lt;p&gt;Built-in ZFS replication between cluster nodes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create replication job
pvesr create-local-job 100-0 pve2 --schedule &apos;*/15&apos;  # Every 15 min

# Check replication status
pvesr status

# List jobs
pvesr list
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;How Replication Works&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Node 1 (primary)           Node 2 (replica)
┌──────────────┐           ┌──────────────┐
│   VM 100     │           │  VM 100      │
│   (active)   │──────────►│  (standby)   │
│              │  ZFS send │              │
└──────────────┘           └──────────────┘

Every 15 minutes: incremental sync
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What Replication Protects Against&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Failure scenario: Node 1 hardware failure
Replica on Node 2: Ready to start
Result: Activate replica, minimal downtime

Failure scenario: Storage failure on Node 1
Replica on Node 2: Has recent copy
Result: Start replica (with potential 15-min data loss)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What Replication Does NOT Protect Against&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Failure scenario: VM data corruption (application bug)
Replication: Replicates the corruption to Node 2
Result: Both copies corrupted

Failure scenario: Ransomware encrypts VM
Replication: Replicates encrypted data
Result: Both copies encrypted

Failure scenario: Accidental VM deletion
Replication: Deletion replicates
Result: Both copies deleted

Failure scenario: Cluster-wide issue
Replication: Both nodes affected
Result: No protection
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Rule: Replication protects against hardware failure, not data corruption.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;The Three-Layer Strategy&lt;/h2&gt;
&lt;p&gt;For critical VMs, use all three:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Layer 1: Snapshots
- Before changes
- Quick rollback
- Same-disk convenience

Layer 2: Backups (PBS)
- Daily/hourly
- Different storage
- Historical retention

Layer 3: Replication
- Near-real-time
- Different node
- Fast failover
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# VM 100: Critical web application

# Layer 1: Manual snapshots before changes
qm snapshot 100 pre-upgrade

# Layer 2: Hourly backups to PBS, 30-day retention
# Backup job: hourly to pbs-store
# Retention: keep-hourly=24,keep-daily=30

# Layer 3: 15-minute replication to second node
pvesr create-local-job 100-0 pve2 --schedule &apos;*/15&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Recovery scenarios:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recovery Method&lt;/th&gt;
&lt;th&gt;Data Loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bad config change&lt;/td&gt;
&lt;td&gt;Rollback snapshot&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Host hardware failure&lt;/td&gt;
&lt;td&gt;Start replica&lt;/td&gt;
&lt;td&gt;Up to 15 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage failure&lt;/td&gt;
&lt;td&gt;Restore from PBS&lt;/td&gt;
&lt;td&gt;Up to 1 hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data corruption&lt;/td&gt;
&lt;td&gt;Restore from PBS (earlier point)&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Site disaster&lt;/td&gt;
&lt;td&gt;Restore from off-site PBS&lt;/td&gt;
&lt;td&gt;Up to 24 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Real Failure Scenarios&lt;/h2&gt;
&lt;h3&gt;Scenario 1: Disk Failure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Situation: ZFS pool loses a disk in mirror
Snapshots: Still available (pool degraded but working)
Replication: Working
Backups: Working

Action: Replace disk, resilver, no VM downtime
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 2: Complete Storage Loss&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Situation: Storage controller failure, pool unimportable
Snapshots: Lost
Replication: Available on other node

Action: Start replica, 15 minutes data loss
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 3: Database Corruption&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Situation: App bug corrupts database on Tuesday
Discovered: Thursday
Replication: Has corrupted data
Recent backups: Have corrupted data
Older backup from Monday: Clean

Action: Restore Monday backup, replay transaction logs if possible
Lesson: Longer backup retention matters
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scenario 4: Ransomware&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Situation: Ransomware encrypts VM on Friday night
Replication: Encrypted copy on second node
Snapshots: Might be encrypted (if attacker accessed)
PBS backups: Clean (PBS not mounted inside VM)

Action: Restore from PBS backup before infection
Lesson: Air-gapped backups survive ransomware
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Calculating Your Strategy&lt;/h2&gt;
&lt;h3&gt;For Each VM, Answer:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RPO&lt;/strong&gt;: How much data loss is acceptable?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Minutes → Replication + frequent backups&lt;/li&gt;
&lt;li&gt;Hours → Hourly backups&lt;/li&gt;
&lt;li&gt;Days → Daily backups&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RTO&lt;/strong&gt;: How fast must it recover?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Minutes → Replication + HA&lt;/li&gt;
&lt;li&gt;Hours → Local PBS restore&lt;/li&gt;
&lt;li&gt;Days → Off-site restore okay&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retention&lt;/strong&gt;: How far back might you need?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Days → Short retention&lt;/li&gt;
&lt;li&gt;Months → Longer retention&lt;/li&gt;
&lt;li&gt;Compliance → Years (archive separately)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Example: Different VM Classes&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Class A: Critical (database, ERP)
- RPO: 15 minutes
- RTO: 30 minutes
- Retention: 90 days
Strategy: Replication (15 min) + Hourly PBS + Monthly off-site

Class B: Important (web servers, apps)
- RPO: 1 hour
- RTO: 4 hours
- Retention: 30 days
Strategy: Hourly PBS backup

Class C: Development (test VMs)
- RPO: 24 hours
- RTO: Next business day
- Retention: 7 days
Strategy: Daily PBS backup

Class D: Ephemeral (CI runners)
- RPO: N/A (rebuild from config)
- RTO: Minutes (just recreate)
- Retention: None
Strategy: No backup (infrastructure as code)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Testing Your Strategy&lt;/h2&gt;
&lt;h3&gt;Monthly Tests&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# 1. Snapshot rollback test
qm snapshot 100 test-snap
# Make a change
qm rollback 100 test-snap
# Verify rollback worked

# 2. Backup restore test
qmrestore pbs-store:backup/vm/100/... 900
qm start 900
# Verify VM works
qm destroy 900

# 3. Replication failover test
# Stop source VM
qm stop 100
# Start replica on other node
# Verify it works
# Fail back to primary
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Document Results&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Test Date: 2025-01-08
Tested by: Admin

Snapshot rollback: PASS (30 seconds)
PBS restore (100GB VM): PASS (12 minutes)
Replication failover: PASS (2 minutes)

Issues found: None
Next test: 2025-02-08
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Replication is not a replacement for PBS. It&apos;s a different layer.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Each protection layer handles different failures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshots&lt;/strong&gt;: Undo mistakes (same disk)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backups&lt;/strong&gt;: Recover from hardware failure (different storage)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication&lt;/strong&gt;: Fast failover (different node)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Off-site&lt;/strong&gt;: Survive site disasters (different location)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The failure you&apos;ll have is the one you didn&apos;t plan for. If you only have replication, you&apos;ll face data corruption. If you only have daily backups, you&apos;ll have the failure at 11 PM. If you only have on-site backups, you&apos;ll have the site disaster.&lt;/p&gt;
&lt;p&gt;Layer your protection. Test your recovery. Know exactly what each layer protects against and what it doesn&apos;t.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>backup</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Backups Done Right: Proxmox Backup Server, Schedules, Retention, and Restore Drills</title><link>https://ashimov.com/posts/proxmox-backups/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-backups/</guid><description>Complete guide to Proxmox Backup Server. Covers installation, incremental backups, deduplication, retention policies, verification, and why a backup only exists after a successful restore test.</description><pubDate>Tue, 26 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Everyone has backups until they need to restore. Then they discover: the backup never completed, the retention deleted what they needed, or worse — they&apos;ve never actually tested a restore.&lt;/p&gt;
&lt;p&gt;Proxmox Backup Server (PBS) is excellent backup software. Deduplication, incremental forever, encryption, verification. But software doesn&apos;t matter if your process is broken.&lt;/p&gt;
&lt;p&gt;A backup exists only after a successful restore test. Everything else is hope.&lt;/p&gt;
&lt;h2&gt;Why Proxmox Backup Server&lt;/h2&gt;
&lt;p&gt;PBS is purpose-built for Proxmox VE:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Incremental forever&lt;/strong&gt;: Only changed blocks transfer after first backup&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deduplication&lt;/strong&gt;: Identical data stored once, even across VMs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encryption&lt;/strong&gt;: Client-side encryption, PBS never sees plaintext&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verification&lt;/strong&gt;: Built-in integrity checking&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pruning&lt;/strong&gt;: Automatic retention policies&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compared to vzdump-to-NFS:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;10x less storage for typical workloads (dedup)&lt;/li&gt;
&lt;li&gt;5x faster backups (incremental)&lt;/li&gt;
&lt;li&gt;Actual verification (not just &quot;file exists&quot;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Installing PBS&lt;/h2&gt;
&lt;h3&gt;Dedicated Machine (Recommended)&lt;/h3&gt;
&lt;p&gt;PBS should run on separate hardware. If your Proxmox host dies, your backups shouldn&apos;t die with it.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Download PBS ISO from proxmox.com
# Install on dedicated hardware or VM (on different host)

# After install, configure network
nano /etc/network/interfaces

# Update repositories (same as PVE)
mv /etc/apt/sources.list.d/pbs-enterprise.list /etc/apt/sources.list.d/pbs-enterprise.list.disabled
echo &quot;deb http://download.proxmox.com/debian/pbs bookworm pbs-no-subscription&quot; &amp;gt; /etc/apt/sources.list.d/pbs-no-subscription.list
apt update &amp;amp;&amp;amp; apt full-upgrade -y
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Access web UI at &lt;code&gt;https://&amp;lt;pbs-ip&amp;gt;:8007&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;PBS as VM on Proxmox&lt;/h3&gt;
&lt;p&gt;Acceptable for homelab, not ideal:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create VM for PBS
qm create 999 --name pbs --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0 \
  --scsi0 local-zfs:32 \
  --cdrom local:iso/proxmox-backup-server.iso
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;: Store PBS VM on different storage than what you&apos;re backing up. If your main storage fails, PBS VM should survive.&lt;/p&gt;
&lt;h2&gt;Storage Configuration&lt;/h2&gt;
&lt;h3&gt;Datastore Setup&lt;/h3&gt;
&lt;p&gt;PBS organizes backups into datastores:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create datastore directory
mkdir -p /backup/datastore1

# Via web UI: Administration → Storage/Disks → Directory → Create: Datastore

# Or via CLI
proxmox-backup-manager datastore create store1 /backup/datastore1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Storage Sizing&lt;/h3&gt;
&lt;p&gt;Deduplication means storage math is different:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Without dedup: 10 VMs × 100GB × 30 backups = 30TB
With dedup:    10 VMs × 100GB × 30 backups ≈ 500GB - 2TB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Actual ratio depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How similar VMs are (templates = high dedup)&lt;/li&gt;
&lt;li&gt;How much data changes between backups&lt;/li&gt;
&lt;li&gt;How many retention points&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Start with 2x your VM total size, monitor, adjust.&lt;/p&gt;
&lt;h2&gt;Connecting Proxmox VE to PBS&lt;/h2&gt;
&lt;h3&gt;Add PBS to Proxmox&lt;/h3&gt;
&lt;p&gt;On Proxmox VE:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Add PBS storage
pvesm add pbs pbs-store \
  --server 10.0.0.50 \
  --datastore store1 \
  --username backup@pbs \
  --password \
  --fingerprint &amp;lt;fingerprint&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Get fingerprint from PBS:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On PBS
proxmox-backup-manager cert info | grep Fingerprint
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: Datacenter → Storage → Add → Proxmox Backup Server&lt;/p&gt;
&lt;h3&gt;Create Backup User&lt;/h3&gt;
&lt;p&gt;On PBS, create dedicated backup user:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create user
proxmox-backup-manager user create backup@pbs

# Create API token (more secure than password)
proxmox-backup-manager user generate-token backup@pbs automation

# Grant permissions
proxmox-backup-manager acl update / Datastore.Backup --auth-id backup@pbs
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use API token in Proxmox connection.&lt;/p&gt;
&lt;h2&gt;Backup Jobs&lt;/h2&gt;
&lt;h3&gt;Manual Backup&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Backup single VM
vzdump 100 --storage pbs-store --mode snapshot

# Backup multiple VMs
vzdump 100 101 102 --storage pbs-store --mode snapshot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scheduled Backups&lt;/h3&gt;
&lt;p&gt;Via Web UI: Datacenter → Backup → Add&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Storage: pbs-store
Schedule: 01:00 (daily at 1 AM)
Selection mode: Include all (or specific VMs)
Mode: Snapshot
Compression: ZSTD
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create backup job
pvesh create /cluster/backup --storage pbs-store --schedule &quot;0 1 * * *&quot; --all 1 --mode snapshot --compress zstd
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Backup Modes&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Downtime&lt;/th&gt;
&lt;th&gt;Consistency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot&lt;/td&gt;
&lt;td&gt;Atomic snapshot, VM keeps running&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Suspend&lt;/td&gt;
&lt;td&gt;Pause VM, backup, resume&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stop&lt;/td&gt;
&lt;td&gt;Shutdown, backup, start&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Recommendation&lt;/strong&gt;: Use snapshot mode. If you need perfect consistency, use application-level tools (database dumps, etc.) before backup.&lt;/p&gt;
&lt;h2&gt;Retention Policies&lt;/h2&gt;
&lt;p&gt;Retention determines how many backups to keep. PBS supports GFS (Grandfather-Father-Son):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Keep last:    3      # Always keep last 3 backups
Keep hourly:  24     # Keep 24 hourly backups
Keep daily:   7      # Keep 7 daily backups
Keep weekly:  4      # Keep 4 weekly backups
Keep monthly: 6      # Keep 6 monthly backups
Keep yearly:  2      # Keep 2 yearly backups
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Recent: Multiple restore points&lt;/li&gt;
&lt;li&gt;Medium-term: Daily granularity&lt;/li&gt;
&lt;li&gt;Long-term: Monthly/yearly for compliance&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configure Retention on Proxmox&lt;/h3&gt;
&lt;p&gt;In backup job configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Edit backup job
pvesh set /cluster/backup/&amp;lt;jobid&amp;gt; --prune-backups keep-last=3,keep-daily=7,keep-weekly=4,keep-monthly=6
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prune Schedule on PBS&lt;/h3&gt;
&lt;p&gt;PBS runs pruning automatically. Check/configure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On PBS
proxmox-backup-manager prune-job list
proxmox-backup-manager prune-job create store1 --schedule &quot;daily&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verification&lt;/h2&gt;
&lt;p&gt;Backups without verification are hopes, not backups.&lt;/p&gt;
&lt;h3&gt;Automatic Verification&lt;/h3&gt;
&lt;p&gt;PBS can verify backups automatically:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On PBS - create verification job
proxmox-backup-manager verify-job create store1 \
  --schedule &quot;weekly&quot; \
  --outdated-after 7
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This reads all backup chunks and verifies checksums. Catches:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bit rot&lt;/li&gt;
&lt;li&gt;Storage corruption&lt;/li&gt;
&lt;li&gt;Incomplete backups&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Manual Verification&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify specific backup
proxmox-backup-client verify &amp;lt;backup-id&amp;gt;

# Verify all backups in datastore
proxmox-backup-manager verify store1
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Restore Testing&lt;/h2&gt;
&lt;p&gt;Verification proves data integrity. Restore testing proves you can actually recover.&lt;/p&gt;
&lt;h3&gt;Schedule Regular Restore Tests&lt;/h3&gt;
&lt;p&gt;Monthly restore drill:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pick a random VM backup&lt;/li&gt;
&lt;li&gt;Restore to temporary VM&lt;/li&gt;
&lt;li&gt;Boot it, verify it works&lt;/li&gt;
&lt;li&gt;Delete temporary VM&lt;/li&gt;
&lt;li&gt;Document the test&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;# Restore to new VM
qmrestore pbs-store:backup/vm/100/2025-01-08T01:00:00Z 900 --storage local-zfs

# Boot and verify
qm start 900

# After verification
qm stop 900
qm destroy 900
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Restore Test Checklist&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Date: 2025-01-08
Backup tested: vm/100/2025-01-08T01:00:00Z
Original VM: web-server (100)
Restored as: test-restore (900)

[ ] Restore completed without errors
[ ] VM boots successfully
[ ] Services start (nginx, database, etc.)
[ ] Application responds (curl localhost)
[ ] Data integrity (sample data check)
[ ] Time to restore: 5 minutes

Tested by: Admin
Result: PASS
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Encryption&lt;/h2&gt;
&lt;p&gt;For off-site backups, enable encryption:&lt;/p&gt;
&lt;h3&gt;Generate Key&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On Proxmox
proxmox-backup-client key create /etc/pve/priv/backup-key.enc

# Protect the key!
cp /etc/pve/priv/backup-key.enc /secure/location/
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configure Encrypted Backups&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Add PBS storage with encryption
pvesm add pbs pbs-encrypted \
  --server 10.0.0.50 \
  --datastore store1 \
  --username backup@pbs \
  --encryption-key /etc/pve/priv/backup-key.enc \
  --fingerprint &amp;lt;fingerprint&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;: If you lose the encryption key, backups are unrecoverable. Store key securely, separately from backups.&lt;/p&gt;
&lt;h2&gt;Monitoring Backups&lt;/h2&gt;
&lt;h3&gt;Check Backup Status&lt;/h3&gt;
&lt;p&gt;On Proxmox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# List recent backups
pvesh get /nodes/pve1/storage/pbs-store/content

# Check backup job status
pvesh get /cluster/backup
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On PBS:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Datastore status
proxmox-backup-manager datastore list

# Recent backup tasks
proxmox-backup-manager task list
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Alerting&lt;/h3&gt;
&lt;p&gt;Configure email alerts on PBS:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Set notification email
proxmox-backup-manager acl update / --notify admin@example.com
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key alerts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Backup job failures&lt;/li&gt;
&lt;li&gt;Verification failures&lt;/li&gt;
&lt;li&gt;Datastore space warnings&lt;/li&gt;
&lt;li&gt;Pruning issues&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Disaster Scenarios&lt;/h2&gt;
&lt;h3&gt;Scenario: VM Accidentally Deleted&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# List available backups
pvesh get /nodes/pve1/storage/pbs-store/content --vmid 100

# Restore
qmrestore pbs-store:backup/vm/100/2025-01-08T01:00:00Z 100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Recovery time: Minutes.&lt;/p&gt;
&lt;h3&gt;Scenario: Proxmox Host Failed&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On new/rebuilt host, add PBS storage
pvesm add pbs pbs-store ...

# List available backups
pvesh get /nodes/pve2/storage/pbs-store/content

# Restore all VMs
pvesh get /nodes/pve2/storage/pbs-store/content --output-format json | \
  jq -r &apos;.[] | &quot;\(.volid) \(.vmid)&quot;&apos; | \
  while read volid vmid; do
    qmrestore &quot;$volid&quot; &quot;$vmid&quot;
  done
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Recovery time: Hours (depends on VM count and size).&lt;/p&gt;
&lt;h3&gt;Scenario: PBS Server Lost&lt;/h3&gt;
&lt;p&gt;This is why you need off-site copies:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# If you have replication to second PBS
# Use backup PBS as primary

# If no replication... restore from:
# - Off-site tape/cloud
# - Secondary backup system
# - Old-school vzdump files
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Off-Site Backups&lt;/h2&gt;
&lt;p&gt;PBS-to-PBS replication:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On source PBS, create sync job
proxmox-backup-manager sync-job create remote-sync \
  --store store1 \
  --remote pbs-remote \
  --remote-store offsite \
  --schedule &quot;daily&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This syncs deduplicated data to remote PBS. Only changed chunks transfer.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A backup exists only after a successful restore test.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Having backup software running is step one. Having backups completing is step two. But the backup doesn&apos;t exist until you&apos;ve proven you can restore from it.&lt;/p&gt;
&lt;p&gt;The process:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Configure&lt;/strong&gt;: PBS, jobs, retention&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automate&lt;/strong&gt;: Scheduled backups, verification&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test&lt;/strong&gt;: Monthly restore drills&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document&lt;/strong&gt;: What was tested, what was the result&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improve&lt;/strong&gt;: Fix issues found during tests&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The companies that recover from disasters aren&apos;t the ones with the best backup software. They&apos;re the ones who practiced recovery before they needed it.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>backup</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Cluster Setup: Joining Nodes, Quorum, and Corosync Realities</title><link>https://ashimov.com/posts/proxmox-cluster/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-cluster/</guid><description>Building a Proxmox cluster correctly. Covers node joining, quorum mechanics, split-brain prevention, Corosync networking, and why clustering is network discipline, not just a button.</description><pubDate>Fri, 22 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A Proxmox cluster looks simple: join nodes, share configuration, migrate VMs between them. Click a button, cluster created. The web UI makes it seem like magic.&lt;/p&gt;
&lt;p&gt;It&apos;s not magic. It&apos;s distributed systems, and distributed systems fail in ways that single nodes don&apos;t. Split-brain scenarios, quorum loss, network partitions — these aren&apos;t theoretical. They happen, and when they do, your VMs stop or corrupt.&lt;/p&gt;
&lt;p&gt;Clustering is not a button. It&apos;s network discipline and failure planning.&lt;/p&gt;
&lt;h2&gt;What a Cluster Actually Is&lt;/h2&gt;
&lt;p&gt;A Proxmox cluster is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Shared configuration&lt;/strong&gt;: &lt;code&gt;/etc/pve&lt;/code&gt; is replicated across all nodes via pmxcfs (a cluster filesystem)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Corosync&lt;/strong&gt;: Cluster communication layer handling membership and messaging&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quorum&lt;/strong&gt;: Voting system to prevent split-brain&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optional&lt;/strong&gt;: Shared storage, HA, live migration&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────┐    Corosync    ┌─────────────┐
│    Node 1   │◄──────────────►│    Node 2   │
│  (vote: 1)  │                │  (vote: 1)  │
└──────┬──────┘                └──────┬──────┘
       │                              │
       │         Corosync             │
       │◄────────────────────────────►│
       │                              │
       ▼                              ▼
┌─────────────┐
│    Node 3   │
│  (vote: 1)  │
└─────────────┘

Quorum: 2 of 3 votes required (majority)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Before You Cluster&lt;/h2&gt;
&lt;h3&gt;Network Requirements&lt;/h3&gt;
&lt;p&gt;Corosync needs reliable, low-latency networking:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dedicated network recommended&lt;/strong&gt;: Separate from VM traffic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Same subnet&lt;/strong&gt;: All nodes must be on same L2 network for Corosync&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Low latency&lt;/strong&gt;: Under 2ms round-trip ideally&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redundant links&lt;/strong&gt;: For production, use bonding or multiple Corosync rings&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Bad ideas:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Corosync over WAN (latency kills it)&lt;/li&gt;
&lt;li&gt;Corosync over congested VM network&lt;/li&gt;
&lt;li&gt;Single network link (any failure = cluster issues)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Hostname and DNS&lt;/h3&gt;
&lt;p&gt;Before clustering, every node needs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Correct hostname
hostnamectl set-hostname pve1.lab.local

# /etc/hosts must resolve all cluster nodes
cat /etc/hosts
127.0.0.1 localhost
10.0.0.10 pve1.lab.local pve1
10.0.0.11 pve2.lab.local pve2
10.0.0.12 pve3.lab.local pve3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;: Hostnames cannot change after clustering. Get them right now.&lt;/p&gt;
&lt;h3&gt;Time Synchronization&lt;/h3&gt;
&lt;p&gt;All nodes must have synchronized time:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check time sync
timedatectl status

# Should show &quot;System clock synchronized: yes&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Time drift causes certificate issues and cluster instability.&lt;/p&gt;
&lt;h2&gt;Creating the Cluster&lt;/h2&gt;
&lt;h3&gt;On First Node (pve1)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create cluster
pvecm create my-cluster

# Verify
pvecm status
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Output shows:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Cluster information
-------------------
Name:             my-cluster
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             ...
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Joining Additional Nodes (pve2, pve3)&lt;/h3&gt;
&lt;p&gt;From each node to join:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Join cluster (run on pve2)
pvecm add 10.0.0.10

# Enter root password for pve1 when prompted
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After joining:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check status from any node
pvecm status

# Should show all nodes
pvecm nodes
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Quorum: Why It Matters&lt;/h2&gt;
&lt;p&gt;Quorum prevents split-brain — where two halves of a cluster both think they&apos;re in charge, making conflicting decisions.&lt;/p&gt;
&lt;h3&gt;How Quorum Works&lt;/h3&gt;
&lt;p&gt;Each node has votes (default: 1). Quorum requires majority:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Nodes&lt;/th&gt;
&lt;th&gt;Votes&lt;/th&gt;
&lt;th&gt;Quorum Needed&lt;/th&gt;
&lt;th&gt;Can Lose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0 nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0 nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1 node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1 node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2 nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Two-node problem&lt;/strong&gt;: With 2 nodes, losing either means no quorum. Both nodes freeze.&lt;/p&gt;
&lt;h3&gt;Two-Node Cluster Solutions&lt;/h3&gt;
&lt;p&gt;Option 1: &lt;strong&gt;QDevice (recommended)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;External quorum device provides tie-breaking vote:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On a separate lightweight VM/LXC (not on cluster nodes!)
apt install corosync-qnetd

# On each cluster node
apt install corosync-qdevice
pvecm qdevice setup 10.0.0.100  # QDevice IP
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now you have 2 nodes + 1 QDevice = 3 votes. Can survive 1 node failure.&lt;/p&gt;
&lt;p&gt;Option 2: &lt;strong&gt;Expected votes override (dangerous)&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On surviving node during split
pvecm expected 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This tells the node &quot;expect only 1 vote for quorum.&quot; &lt;strong&gt;Dangerous&lt;/strong&gt; — only use when you&apos;re certain the other node is truly dead.&lt;/p&gt;
&lt;h3&gt;Checking Quorum Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Detailed quorum info
pvecm status

# Is cluster quorate?
pvecm status | grep Quorate
# Quorate: Yes  (means cluster can operate)
# Quorate: No   (means cluster is frozen)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Corosync Configuration&lt;/h2&gt;
&lt;h3&gt;View Current Config&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;cat /etc/pve/corosync.conf
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Redundant Corosync Links&lt;/h3&gt;
&lt;p&gt;For production, use multiple networks:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# View current links
pvecm status

# Add second link
pvecm addlink 0 10.10.0.10  # Node 0, second network IP
pvecm addlink 1 10.10.0.11  # Node 1
pvecm addlink 2 10.10.0.12  # Node 2
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now Corosync uses both networks. If one fails, the other maintains cluster.&lt;/p&gt;
&lt;h3&gt;Network Interface Configuration&lt;/h3&gt;
&lt;p&gt;Ensure Corosync interfaces are correctly configured:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check which interfaces Corosync uses
corosync-cfgtool -s

# Should show ring status for each link
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Common Cluster Operations&lt;/h2&gt;
&lt;h3&gt;Node Maintenance&lt;/h3&gt;
&lt;p&gt;Before working on a node:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Migrate all VMs off the node
# Via Web UI or:
for vmid in $(qm list | awk &apos;NR&amp;gt;1 {print $1}&apos;); do
    qm migrate $vmid pve2 --online
done

# If using HA, disable it temporarily
ha-manager set vm:100 --state disabled
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Removing a Node&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# On node being removed - stop cluster services
systemctl stop pve-cluster corosync

# On remaining node
pvecm delnode pve3

# On removed node - clean up
rm -rf /etc/pve/nodes/pve3
rm /etc/corosync/*
rm /var/lib/corosync/*
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Adding Node Back After Removal&lt;/h3&gt;
&lt;p&gt;The node must be completely clean:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# On the node to re-add
systemctl stop pve-cluster corosync
rm -rf /etc/pve/*
rm -rf /etc/corosync/*
rm -rf /var/lib/corosync/*

# Then join fresh
pvecm add 10.0.0.10
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Split-Brain Scenarios&lt;/h2&gt;
&lt;h3&gt;What Happens&lt;/h3&gt;
&lt;p&gt;Network partition between nodes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;┌─────────┐         X         ┌─────────┐
│  pve1   │─────────X─────────│  pve2   │
│ (alone) │         X         │ (alone) │
└─────────┘   (network cut)   └─────────┘

Both nodes think: &quot;Is the other dead, or just unreachable?&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Without quorum:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Neither can be sure the other is truly dead&lt;/li&gt;
&lt;li&gt;Both freeze rather than risk conflicting operations&lt;/li&gt;
&lt;li&gt;VMs stop (better than corruption)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With quorum (3+ nodes or QDevice):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Majority side continues operating&lt;/li&gt;
&lt;li&gt;Minority side freezes&lt;/li&gt;
&lt;li&gt;Clear decision, no conflict&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Recovering from Split-Brain&lt;/h3&gt;
&lt;p&gt;If both sides made changes (shouldn&apos;t happen with proper quorum):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check pmxcfs status
cat /etc/pve/.members

# Force resync (DANGEROUS - data loss possible)
systemctl stop pve-cluster
pmxcfs -l  # Local mode
# Review /etc/pve, fix conflicts manually
systemctl start pve-cluster
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is why you prevent split-brain rather than recover from it.&lt;/p&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Cluster Won&apos;t Form&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check Corosync status
systemctl status corosync

# Check logs
journalctl -u corosync -f

# Common issues:
# - Firewall blocking ports 5405-5412/udp
# - Hostname mismatch
# - Time drift
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Node Shows as Offline&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check from &quot;offline&quot; node
pvecm status

# Check network connectivity
ping pve1
ping pve2

# Check Corosync communication
corosync-cfgtool -s
# Ring should show &quot;no faults&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&quot;Cluster not quorate&quot; Error&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check how many nodes are visible
pvecm nodes

# If nodes are missing, check network
# If all nodes present but not quorate, check vote count
pvecm status | grep -E &quot;Expected|Total&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Network Design for Clusters&lt;/h2&gt;
&lt;h3&gt;Minimum (Lab)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;                    ┌─────────────┐
All traffic ───────►│   Switch    │
                    └──────┬──────┘
              ┌────────────┼────────────┐
              ▼            ▼            ▼
           pve1         pve2         pve3
        10.0.0.10    10.0.0.11    10.0.0.12
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Single network for everything. Works, but any network issue affects cluster.&lt;/p&gt;
&lt;h3&gt;Recommended (Production)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Corosync Network (dedicated)
          ┌─────────────┐
          │  Switch A   │
          └──────┬──────┘
    ┌────────────┼────────────┐
    ▼            ▼            ▼
 pve1         pve2         pve3
10.10.0.10  10.10.0.11  10.10.0.12

Management + VM Network
          ┌─────────────┐
          │  Switch B   │
          └──────┬──────┘
    ┌────────────┼────────────┐
    ▼            ▼            ▼
 pve1         pve2         pve3
10.0.0.10   10.0.0.11   10.0.0.12
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Separate networks. Corosync traffic isolated from VM traffic.&lt;/p&gt;
&lt;h3&gt;Best (Production + Redundancy)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Corosync Ring 0          Corosync Ring 1
    Switch A                 Switch B
       │                        │
   ┌───┼───┐                ┌───┼───┐
   ▼   ▼   ▼                ▼   ▼   ▼
 pve1 pve2 pve3           pve1 pve2 pve3

Both rings active. Either can fail without cluster impact.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A cluster is not a button. It&apos;s network discipline and failure planning.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Clicking &quot;Create Cluster&quot; is the easy part. The hard part is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Network reliability (Corosync needs it)&lt;/li&gt;
&lt;li&gt;Quorum planning (how many nodes can you lose?)&lt;/li&gt;
&lt;li&gt;Split-brain prevention (QDevice for 2 nodes)&lt;/li&gt;
&lt;li&gt;Failure testing (does it actually fail over?)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A cluster that hasn&apos;t been failure-tested is a cluster that will surprise you. Test node failures. Test network partitions. Know what happens before production depends on it.&lt;/p&gt;
&lt;p&gt;The goal isn&apos;t &quot;we have a cluster.&quot; The goal is &quot;we understand how our cluster fails and have planned for it.&quot;&lt;/p&gt;
</content:encoded><category>proxmox</category><category>ha</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>LXC vs VM: When Containers Are a Gift (and When They Bite)</title><link>https://ashimov.com/posts/proxmox-lxc-vs-vm/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-lxc-vs-vm/</guid><description>Practical guide to choosing between LXC containers and VMs on Proxmox. Covers performance differences, security boundaries, use cases, and why containers offer speed but not always isolation.</description><pubDate>Tue, 19 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Proxmox gives you two virtualization options: KVM virtual machines and LXC containers. Both run workloads. Both appear as separate systems. But under the hood, they&apos;re fundamentally different — and that difference matters more than most people realize.&lt;/p&gt;
&lt;p&gt;Containers are fast. Boot in seconds, minimal overhead, efficient resource use. VMs are slower to start, use more memory, but provide real isolation.&lt;/p&gt;
&lt;p&gt;The question isn&apos;t &quot;which is better&quot; — it&apos;s &quot;which is appropriate.&quot; And getting that wrong means either wasting resources or creating security problems.&lt;/p&gt;
&lt;h2&gt;How They Actually Work&lt;/h2&gt;
&lt;h3&gt;KVM Virtual Machines&lt;/h3&gt;
&lt;p&gt;Each VM runs its own kernel. Complete isolation from the host.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Host Kernel (Proxmox)
     │
     └── QEMU/KVM Hypervisor
              │
              ├── VM1 (Linux kernel) ─── Processes
              ├── VM2 (Windows kernel) ── Processes
              └── VM3 (Linux kernel) ─── Processes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The hypervisor virtualizes hardware. Each VM thinks it has its own CPU, memory, disk. Complete isolation — a bug in VM1&apos;s kernel can&apos;t affect VM2.&lt;/p&gt;
&lt;h3&gt;LXC Containers&lt;/h3&gt;
&lt;p&gt;All containers share the host kernel. Isolation via namespaces and cgroups.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Host Kernel (Proxmox)
     │
     ├── Container 1 (namespace) ─── Processes
     ├── Container 2 (namespace) ─── Processes
     └── Container 3 (namespace) ─── Processes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Containers are isolated userspace instances. They share the host kernel, just in different namespaces. Faster, lighter — but a kernel vulnerability affects everything.&lt;/p&gt;
&lt;h2&gt;Performance Comparison&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;LXC Container&lt;/th&gt;
&lt;th&gt;KVM VM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Boot time&lt;/td&gt;
&lt;td&gt;1-5 seconds&lt;/td&gt;
&lt;td&gt;15-60 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory overhead&lt;/td&gt;
&lt;td&gt;~10-50MB&lt;/td&gt;
&lt;td&gt;~200-500MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU overhead&lt;/td&gt;
&lt;td&gt;Near zero&lt;/td&gt;
&lt;td&gt;2-5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk I/O&lt;/td&gt;
&lt;td&gt;Native speed&lt;/td&gt;
&lt;td&gt;Near-native (virtio)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network I/O&lt;/td&gt;
&lt;td&gt;Native speed&lt;/td&gt;
&lt;td&gt;Near-native (virtio)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Density&lt;/td&gt;
&lt;td&gt;50-100+ per host&lt;/td&gt;
&lt;td&gt;10-30 per host&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For equivalent workload, containers use significantly fewer resources.&lt;/p&gt;
&lt;h2&gt;Creating Containers&lt;/h2&gt;
&lt;h3&gt;Download Template&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# List available templates
pveam available

# Download Ubuntu template
pveam download local ubuntu-24.04-standard_24.04-2_amd64.tar.zst
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: Datacenter → local → CT Templates → Templates&lt;/p&gt;
&lt;h3&gt;Create Container&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;pct create 200 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst \
  --hostname web-container \
  --memory 1024 \
  --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.0.0.200/24,gw=10.0.0.1 \
  --storage local-zfs \
  --rootfs local-zfs:8 \
  --password &quot;temporary&quot; \
  --unprivileged 1

# Start container
pct start 200

# Enter container
pct enter 200
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Via Web UI: Create CT → follow wizard.&lt;/p&gt;
&lt;h3&gt;Unprivileged vs Privileged&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Unprivileged (default, recommended):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pct create 200 ... --unprivileged 1
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Container root (UID 0) maps to unprivileged user on host (UID 100000+)&lt;/li&gt;
&lt;li&gt;Even if container is compromised, attacker can&apos;t escalate to host root&lt;/li&gt;
&lt;li&gt;Some things don&apos;t work (NFS mounts, raw disk access)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Privileged:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pct create 200 ... --unprivileged 0
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Container root is host root (UID 0)&lt;/li&gt;
&lt;li&gt;Container escape = host root access&lt;/li&gt;
&lt;li&gt;Needed for: NFS, some filesystems, hardware passthrough&lt;/li&gt;
&lt;li&gt;Use only when necessary, with additional security&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Security Boundaries&lt;/h2&gt;
&lt;p&gt;This is where the choice matters most.&lt;/p&gt;
&lt;h3&gt;Container Security Reality&lt;/h3&gt;
&lt;p&gt;Containers share the kernel. This means:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Kernel vulnerability
     ↓
Affects host AND all containers
     ↓
Container escape possible
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Real examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dirty COW (CVE-2016-5195)&lt;/strong&gt;: Write to read-only memory. Container escape.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dirty Pipe (CVE-2022-0847)&lt;/strong&gt;: Overwrite files. Container escape.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Various cgroup escapes&lt;/strong&gt;: Break out of isolation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Containers are NOT a security boundary against malicious actors. They&apos;re a convenience boundary for trusted workloads.&lt;/p&gt;
&lt;h3&gt;VM Security Reality&lt;/h3&gt;
&lt;p&gt;VMs have their own kernel. Attacker must escape hypervisor:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Guest kernel vulnerability
     ↓
Only affects that VM
     ↓
Hypervisor escape required for host access
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hypervisor escapes exist (Spectre, Meltdown, VENOM) but are rarer and usually patched quickly. VMs are a real security boundary.&lt;/p&gt;
&lt;h3&gt;When Security Matters&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use VMs when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Running untrusted code&lt;/li&gt;
&lt;li&gt;Multi-tenant (different customers)&lt;/li&gt;
&lt;li&gt;Security-critical workloads&lt;/li&gt;
&lt;li&gt;Compliance requirements (PCI-DSS, HIPAA often require VMs)&lt;/li&gt;
&lt;li&gt;Windows workloads&lt;/li&gt;
&lt;li&gt;Different OS requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use containers when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single-tenant (all your own workloads)&lt;/li&gt;
&lt;li&gt;Trusted code only&lt;/li&gt;
&lt;li&gt;Resource efficiency matters more than perfect isolation&lt;/li&gt;
&lt;li&gt;Linux-only workloads&lt;/li&gt;
&lt;li&gt;Development environments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Practical Use Cases&lt;/h2&gt;
&lt;h3&gt;Container-Appropriate&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Pi-hole DNS:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pct create 201 local:vztmpl/debian-12-standard_12.5-1_amd64.tar.zst \
  --hostname pihole \
  --memory 512 \
  --cores 1 \
  --net0 name=eth0,bridge=vmbr0,ip=10.0.0.53/24,gw=10.0.0.1 \
  --unprivileged 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;DNS is trusted, internal, lightweight. Perfect for container.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reverse proxy (nginx/traefik):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pct create 202 ... --hostname proxy --memory 256
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Forwards traffic, minimal state. Container is ideal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Internal monitoring (Prometheus, Grafana):&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Internal tools, trusted environment. Containers save resources.&lt;/p&gt;
&lt;h3&gt;VM-Appropriate&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Database server:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm create 300 --name db-server --memory 8192 --cores 4 ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Critical data. If something breaks, don&apos;t risk it affecting other workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Customer-facing web application:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Untrusted input from internet. VM provides real isolation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Windows anything:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm create 301 --name windows-server --ostype win11 ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Windows doesn&apos;t run in LXC. VMs only.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Kubernetes nodes:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Docker-in-LXC works but is fragile. VMs are more reliable for k8s.&lt;/p&gt;
&lt;h2&gt;Container Features&lt;/h2&gt;
&lt;h3&gt;Bind Mounts&lt;/h3&gt;
&lt;p&gt;Share host directories with container:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pct set 200 --mp0 /data/shared,mp=/mnt/shared
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Useful for config management, but widens attack surface.&lt;/p&gt;
&lt;h3&gt;Device Passthrough&lt;/h3&gt;
&lt;p&gt;For hardware access (requires privileged or specific config):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# GPU passthrough
pct set 200 --features nesting=1
echo &quot;lxc.cgroup2.devices.allow: c 195:* rwm&quot; &amp;gt;&amp;gt; /etc/pve/lxc/200.conf
echo &quot;lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file&quot; &amp;gt;&amp;gt; /etc/pve/lxc/200.conf
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Complex and fragile. Consider VM for hardware passthrough.&lt;/p&gt;
&lt;h3&gt;Nesting (Docker in LXC)&lt;/h3&gt;
&lt;p&gt;Run Docker inside container:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;pct set 200 --features nesting=1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Works for simple cases. For production Docker workloads, use VMs.&lt;/p&gt;
&lt;h2&gt;Resource Limits&lt;/h2&gt;
&lt;h3&gt;Container Limits&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Memory (hard limit)
pct set 200 --memory 2048

# CPU cores
pct set 200 --cores 2

# CPU limit (percentage)
pct set 200 --cpulimit 1.5  # Max 1.5 cores worth

# Disk quota
pct resize 200 rootfs 20G
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VM Limits&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Memory
qm set 100 --memory 4096

# Balloonig (dynamic memory)
qm set 100 --balloon 2048  # Minimum, can grow to --memory

# CPU
qm set 100 --cores 4 --sockets 1

# CPU type
qm set 100 --cpu host  # Pass through host CPU features
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Migration&lt;/h2&gt;
&lt;h3&gt;Container Migration&lt;/h3&gt;
&lt;p&gt;Fast — only filesystem moves:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Offline (stopped)
pct migrate 200 pve2

# Online (running) - requires shared storage
pct migrate 200 pve2 --online
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VM Migration&lt;/h3&gt;
&lt;p&gt;Slower — memory state must transfer:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Offline
qm migrate 100 pve2

# Live migration - requires shared storage
qm migrate 100 pve2 --online
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Backup and Restore&lt;/h2&gt;
&lt;p&gt;Both work similarly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Backup container
vzdump 200 --storage backup --mode snapshot

# Backup VM
vzdump 100 --storage backup --mode snapshot

# Restore
pct restore 200 /backup/vzdump-lxc-200-*.tar.zst
qm restore 100 /backup/vzdump-qemu-100-*.vma.zst
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Containers backup faster (smaller, no memory state).&lt;/p&gt;
&lt;h2&gt;Hybrid Approach&lt;/h2&gt;
&lt;p&gt;In practice, use both:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Production Layout:
├── LXC Containers (internal services)
│   ├── 200: pihole (DNS)
│   ├── 201: nginx (reverse proxy)
│   ├── 202: prometheus (monitoring)
│   └── 203: grafana (dashboards)
│
└── VMs (security-sensitive)
    ├── 100: web-app (internet-facing)
    ├── 101: database (critical data)
    ├── 102: backup-server (recovery)
    └── 103: windows-dc (Active Directory)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Containers for internal, trusted, lightweight.
VMs for external, critical, or Windows.&lt;/p&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;Container Won&apos;t Start&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check logs
pct start 200 --debug

# Common issues:
# - AppArmor blocking: check /var/log/kern.log
# - Disk full: check storage
# - Network collision: check IP conflicts
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Container Networking Issues&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Enter container
pct enter 200

# Check interface
ip a

# Check gateway
ip route

# From host, check bridge
brctl show vmbr0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Unprivileged Container Limitations&lt;/h3&gt;
&lt;p&gt;If something fails in unprivileged container:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Try enabling features
pct set 200 --features nesting=1,keyctl=1

# If still fails, might need privileged
# Consider: is this really appropriate for a container?
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Containers are speed, but not always isolation.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The temptation is to use containers everywhere — they&apos;re faster, lighter, easier. But containers share a kernel. That kernel is your security boundary. A container escape becomes a host compromise.&lt;/p&gt;
&lt;p&gt;When to contain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Trusted workloads&lt;/li&gt;
&lt;li&gt;Single-owner environment&lt;/li&gt;
&lt;li&gt;Resource efficiency priority&lt;/li&gt;
&lt;li&gt;Linux services&lt;/li&gt;
&lt;li&gt;Internal tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When to virtualize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Untrusted inputs&lt;/li&gt;
&lt;li&gt;Multi-tenant&lt;/li&gt;
&lt;li&gt;Security-critical&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;Compliance requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The hybrid approach works best: containers for the lightweight stuff, VMs for what matters. Don&apos;t let container efficiency seduce you into container-izing everything. Sometimes the VM overhead is the security boundary you need.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>security</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Templates &amp; Cloud-Init: Faster VMs Without Chaos</title><link>https://ashimov.com/posts/proxmox-templates/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-templates/</guid><description>Creating and using VM templates with cloud-init on Proxmox. Covers template creation workflow, cloud-init configuration, customization, and why a template is a contract that must stay stable.</description><pubDate>Fri, 15 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Installing an OS from ISO takes 15-30 minutes. Do that for every VM and you&apos;ll spend more time waiting for installers than doing actual work. Templates solve this: install once, clone many times. A new VM in seconds instead of minutes.&lt;/p&gt;
&lt;p&gt;But templates have a hidden cost. When the template changes, everything cloned from it is different. When the template is inconsistent, every VM is a surprise. The template is a contract — if it floats, everything downstream breaks.&lt;/p&gt;
&lt;p&gt;This is how to build templates that work reliably.&lt;/p&gt;
&lt;h2&gt;The Basic Workflow&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Install OS from ISO (once)&lt;/li&gt;
&lt;li&gt;Configure base system (packages, settings)&lt;/li&gt;
&lt;li&gt;Add cloud-init for per-VM customization&lt;/li&gt;
&lt;li&gt;Convert to template&lt;/li&gt;
&lt;li&gt;Clone template for new VMs&lt;/li&gt;
&lt;li&gt;Cloud-init configures hostname, network, SSH keys&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   ISO       │───▶│  Base VM    │───▶│  Template   │
│   Install   │    │  Configure  │    │  (frozen)   │
└─────────────┘    └─────────────┘    └──────┬──────┘
                                             │
                   ┌─────────────────────────┼─────────────────────────┐
                   ▼                         ▼                         ▼
            ┌─────────────┐           ┌─────────────┐           ┌─────────────┐
            │   Clone 1   │           │   Clone 2   │           │   Clone 3   │
            │  web-server │           │  db-server  │           │  app-server │
            └─────────────┘           └─────────────┘           └─────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Creating a Base VM&lt;/h2&gt;
&lt;p&gt;Start with minimal install:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Download cloud image (Ubuntu example)
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Create VM
qm create 9000 --name &quot;ubuntu-24.04-template&quot; --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0

# Import cloud image as disk
qm importdisk 9000 noble-server-cloudimg-amd64.img local-zfs

# Attach disk
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-zfs:vm-9000-disk-0

# Add cloud-init drive
qm set 9000 --ide2 local-zfs:cloudinit

# Set boot order
qm set 9000 --boot c --bootdisk scsi0

# Enable serial console (for cloud-init)
qm set 9000 --serial0 socket --vga serial0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create VM with name like &lt;code&gt;ubuntu-2404-template&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;OS: Do not use any media (we&apos;ll import cloud image)&lt;/li&gt;
&lt;li&gt;System: SCSI Controller = VirtIO SCSI&lt;/li&gt;
&lt;li&gt;Disks: Delete default disk&lt;/li&gt;
&lt;li&gt;After creation: Hardware → Add → CloudInit Drive&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Using ISO Instead of Cloud Image&lt;/h3&gt;
&lt;p&gt;If you prefer traditional install:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create VM with ISO
qm create 9001 --name &quot;debian-12-template&quot; --memory 2048 --cores 2 --cdrom local:iso/debian-12.iso --net0 virtio,bridge=vmbr0

# Add disk
qm set 9001 --scsihw virtio-scsi-pci --scsi0 local-zfs:32

# Start and install via console
qm start 9001
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Install OS, then prepare for templating (see next section).&lt;/p&gt;
&lt;h2&gt;Preparing for Template&lt;/h2&gt;
&lt;p&gt;Before converting to template, clean up the VM:&lt;/p&gt;
&lt;h3&gt;On Debian/Ubuntu&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Update everything
apt update &amp;amp;&amp;amp; apt full-upgrade -y

# Install cloud-init and QEMU guest agent
apt install -y cloud-init qemu-guest-agent

# Enable guest agent
systemctl enable qemu-guest-agent

# Clean package cache
apt clean
apt autoremove -y

# Remove machine-specific data
rm -f /etc/machine-id
rm -f /var/lib/dbus/machine-id
truncate -s 0 /etc/machine-id

# Remove SSH host keys (regenerate on first boot)
rm -f /etc/ssh/ssh_host_*

# Remove cloud-init state (so it runs fresh on clone)
cloud-init clean

# Clear logs
journalctl --rotate
journalctl --vacuum-time=1s
rm -rf /var/log/*.log
rm -rf /var/log/*.gz

# Clear bash history
history -c
rm -f /root/.bash_history
rm -f /home/*/.bash_history

# Shutdown
shutdown -h now
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;On RHEL/AlmaLinux/Rocky&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Update
dnf update -y

# Install cloud-init and guest agent
dnf install -y cloud-init qemu-guest-agent

# Enable services
systemctl enable qemu-guest-agent
systemctl enable cloud-init

# Clean
dnf clean all
rm -rf /var/cache/dnf/*

# Same cleanup as Debian
rm -f /etc/machine-id
rm -f /etc/ssh/ssh_host_*
cloud-init clean
# ... etc
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Cloud-Init Configuration&lt;/h2&gt;
&lt;p&gt;Cloud-init reads metadata at boot and configures the VM. Proxmox provides this via a special drive.&lt;/p&gt;
&lt;h3&gt;Proxmox Cloud-Init Settings&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Set cloud-init options
qm set 9000 --ciuser admin
qm set 9000 --cipassword &apos;temporary-password&apos;
qm set 9000 --sshkeys ~/.ssh/id_ed25519.pub
qm set 9000 --ipconfig0 ip=dhcp
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: VM → Cloud-Init tab:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;User: admin&lt;/li&gt;
&lt;li&gt;Password: (set or leave empty for SSH-only)&lt;/li&gt;
&lt;li&gt;SSH public key: paste your key&lt;/li&gt;
&lt;li&gt;IP Config: DHCP or static&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Custom Cloud-Init&lt;/h3&gt;
&lt;p&gt;For advanced configuration, use snippets:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create snippets storage if needed
pvesm add dir snippets --path /var/lib/vz/snippets --content snippets

# Create custom cloud-init config
cat &amp;gt; /var/lib/vz/snippets/custom-user-data.yaml &amp;lt;&amp;lt; &apos;EOF&apos;
#cloud-config
package_update: true
package_upgrade: true
packages:
  - vim
  - htop
  - curl
  - git

users:
  - name: admin
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - ssh-ed25519 AAAA... your-key

write_files:
  - path: /etc/motd
    content: |
      Welcome to the VM
      Provisioned by cloud-init

runcmd:
  - systemctl enable --now qemu-guest-agent
  - timedatectl set-timezone UTC
EOF
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply to VM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm set 9000 --cicustom &quot;user=snippets:snippets/custom-user-data.yaml&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Converting to Template&lt;/h2&gt;
&lt;p&gt;Once the VM is prepared:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Convert to template
qm template 9000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: Right-click VM → Convert to template&lt;/p&gt;
&lt;p&gt;The VM icon changes to indicate it&apos;s a template. Templates cannot be started — only cloned.&lt;/p&gt;
&lt;h2&gt;Cloning VMs&lt;/h2&gt;
&lt;h3&gt;Full Clone&lt;/h3&gt;
&lt;p&gt;Creates independent copy. Disk is duplicated.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm clone 9000 100 --name &quot;web-server&quot; --full
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Linked Clone&lt;/h3&gt;
&lt;p&gt;Shares base disk with template. Uses less space but depends on template.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;qm clone 9000 101 --name &quot;test-server&quot; --full 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;: If you delete the template, linked clones break. Use full clones for production.&lt;/p&gt;
&lt;h3&gt;Post-Clone Configuration&lt;/h3&gt;
&lt;p&gt;After cloning, customize via cloud-init:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Set hostname (cloud-init will apply on boot)
qm set 100 --name &quot;web-server&quot;

# Set static IP
qm set 100 --ipconfig0 ip=10.0.0.100/24,gw=10.0.0.1

# Start VM
qm start 100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cloud-init runs on first boot, setting hostname, network, and SSH keys.&lt;/p&gt;
&lt;h2&gt;Template Versioning&lt;/h2&gt;
&lt;p&gt;Templates evolve. Kernel updates, package changes, security patches. Track versions:&lt;/p&gt;
&lt;h3&gt;Naming Convention&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;ubuntu-2404-v1       # Initial release
ubuntu-2404-v2       # Security update
ubuntu-2404-v3       # Added monitoring agent
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Version in Description&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;qm set 9000 --description &quot;Ubuntu 24.04 LTS
Version: 3
Date: 2025-01-08
Changes:
- Added node_exporter
- Updated to kernel 6.8
- Fixed cloud-init network&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Golden Image Process&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;1. Monthly: Create new template from fresh ISO
2. Weekly: Update packages on existing template (requires unconverting)
3. Document: What changed, why, who approved

Template lifecycle:
  new-template → testing → production → deprecated → deleted
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Multiple Templates&lt;/h2&gt;
&lt;p&gt;Different workloads need different templates:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;9000: ubuntu-2404-minimal     # Base, SSH only
9001: ubuntu-2404-web         # + nginx, certbot
9002: ubuntu-2404-docker      # + docker, compose
9003: debian-12-minimal       # Different OS
9004: almalinux-9-minimal     # RHEL-compatible
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Build specialized templates from minimal:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Clone minimal as base for web template
qm clone 9000 9100 --name &quot;ubuntu-2404-web-prep&quot; --full
qm start 9100

# SSH in, install web packages
ssh admin@&amp;lt;ip&amp;gt;
sudo apt install -y nginx certbot python3-certbot-nginx
# ... configure ...
sudo cloud-init clean
sudo shutdown -h now

# Convert to template
qm template 9100
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Troubleshooting Cloud-Init&lt;/h2&gt;
&lt;h3&gt;Cloud-Init Not Running&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check if cloud-init ran
cloud-init status

# View logs
cat /var/log/cloud-init.log
cat /var/log/cloud-init-output.log

# Re-run cloud-init
cloud-init clean
cloud-init init
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Network Not Configured&lt;/h3&gt;
&lt;p&gt;Check Proxmox cloud-init settings:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# View current cloud-init config
qm cloudinit dump 100 user
qm cloudinit dump 100 network
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Inside VM:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check cloud-init network config
cat /etc/netplan/*.yaml  # Ubuntu
cat /etc/sysconfig/network-scripts/*  # RHEL
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SSH Key Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Verify key was injected
cat /home/admin/.ssh/authorized_keys

# Check cloud-init log for errors
grep -i ssh /var/log/cloud-init.log
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hostname Not Set&lt;/h3&gt;
&lt;p&gt;Cloud-init sets hostname early. If it&apos;s still &quot;localhost&quot;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check cloud-init status
cloud-init status --long

# Force hostname update
hostnamectl set-hostname web-server
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Automation with Templates&lt;/h2&gt;
&lt;p&gt;Combine templates with automation:&lt;/p&gt;
&lt;h3&gt;Terraform&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;resource &quot;proxmox_vm_qemu&quot; &quot;web_servers&quot; {
  count       = 3
  name        = &quot;web-${count.index + 1}&quot;
  target_node = &quot;pve1&quot;
  clone       = &quot;ubuntu-2404-minimal&quot;
  full_clone  = true

  cores   = 2
  memory  = 4096

  network {
    model  = &quot;virtio&quot;
    bridge = &quot;vmbr0&quot;
    tag    = 20
  }

  ipconfig0 = &quot;ip=10.20.0.${count.index + 10}/24,gw=10.20.0.1&quot;

  ciuser  = &quot;admin&quot;
  sshkeys = file(&quot;~/.ssh/id_ed25519.pub&quot;)
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Ansible&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;- name: Create VM from template
  community.general.proxmox_kvm:
    api_host: pve1.lab.local
    api_user: admin@pam
    api_token_id: ansible
    api_token_secret: &quot;{{ vault_proxmox_token }}&quot;
    node: pve1
    name: &quot;web-server&quot;
    clone: &quot;ubuntu-2404-minimal&quot;
    full: yes
    ciuser: admin
    sshkeys: &quot;{{ lookup(&apos;file&apos;, &apos;~/.ssh/id_ed25519.pub&apos;) }}&quot;
    ipconfig:
      ipconfig0: &quot;ip=10.0.0.100/24,gw=10.0.0.1&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A template is a contract. If it floats, everything downstream breaks.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The template defines what every cloned VM starts with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Installed packages&lt;/li&gt;
&lt;li&gt;Security configuration&lt;/li&gt;
&lt;li&gt;User accounts&lt;/li&gt;
&lt;li&gt;Base services&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When you change the template, new VMs get the change. Existing VMs don&apos;t — they&apos;re already deployed. This creates drift.&lt;/p&gt;
&lt;p&gt;Treat templates like production artifacts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Version them (ubuntu-2404-v3, not just ubuntu-2404)&lt;/li&gt;
&lt;li&gt;Document changes (what, when, why)&lt;/li&gt;
&lt;li&gt;Test before promoting (clone, verify, then use for production)&lt;/li&gt;
&lt;li&gt;Retire old versions (don&apos;t let 5 versions of &quot;ubuntu template&quot; accumulate)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A stable template means predictable deployments. An unstable template means debugging why &quot;this VM is different&quot; every time something breaks.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>automation</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Networking Baseline: Bridges, VLANs, Bonding — and the Mistakes I Made</title><link>https://ashimov.com/posts/proxmox-networking/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-networking/</guid><description>Proxmox networking fundamentals and common pitfalls. Covers Linux bridges, VLAN configuration, bonding modes, network isolation, and why 99% of virtualization network problems are inconsistent Layer 2.</description><pubDate>Tue, 12 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Networking in Proxmox breaks more setups than storage and compute combined. Not because it&apos;s complicated — it&apos;s actually simpler than most people expect. It breaks because virtualization networking requires consistency at Layer 2, and inconsistency is invisible until nothing works.&lt;/p&gt;
&lt;p&gt;I&apos;ve debugged countless &quot;my VM has no network&quot; issues. 99% were: wrong VLAN tag, wrong bridge, or physical switch misconfiguration. This is how to get networking right from the start.&lt;/p&gt;
&lt;h2&gt;The Default Setup&lt;/h2&gt;
&lt;p&gt;After a clean Proxmox install, you have:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Physical NIC (eno1/eth0)
    └── vmbr0 (Linux bridge)
            ├── Proxmox host (management IP)
            └── VMs connect here
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is correct and works. Don&apos;t overcomplicate it until you need to.&lt;/p&gt;
&lt;p&gt;Check your current config:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat /etc/network/interfaces
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Default looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;eno1&lt;/code&gt; has no IP (manual) — it&apos;s just a bridge port&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vmbr0&lt;/code&gt; has the IP — this is your management address&lt;/li&gt;
&lt;li&gt;VMs attach to &lt;code&gt;vmbr0&lt;/code&gt; and share the same network&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Linux Bridges Explained&lt;/h2&gt;
&lt;p&gt;A bridge is a virtual switch. Physical NICs and virtual NICs connect to it.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;                    ┌─────────────────────────────┐
   Physical Network │        vmbr0 (bridge)       │ Virtual Network
   ─────────────────│                             │──────────────────
        eno1 ───────│ port                   port │─── VM1 (tap100i0)
                    │                        port │─── VM2 (tap101i0)
                    │                        port │─── Proxmox host
                    └─────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All devices on the bridge see each other at Layer 2. Same broadcast domain, same VLAN (unless you add tagging).&lt;/p&gt;
&lt;h3&gt;Creating Additional Bridges&lt;/h3&gt;
&lt;p&gt;For network isolation, create multiple bridges:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Edit network config
nano /etc/network/interfaces
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Add:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Management network (existing)
auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

# DMZ network (new bridge, second NIC)
auto vmbr1
iface vmbr1 inet manual
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0

# Internal-only network (no physical port)
auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ifreload -a
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;vmbr0&lt;/code&gt;: Management + production VMs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vmbr1&lt;/code&gt;: DMZ VMs (different physical NIC, isolated)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;vmbr2&lt;/code&gt;: Internal-only (VMs can talk to each other, no outside access)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;VLANs&lt;/h2&gt;
&lt;p&gt;VLANs separate traffic on the same physical network. Essential when you have one physical NIC but need multiple isolated networks.&lt;/p&gt;
&lt;h3&gt;VLAN-Aware Bridge (Recommended)&lt;/h3&gt;
&lt;p&gt;Modern approach. One bridge handles multiple VLANs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
    bridge-pvid 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;bridge-vlan-aware yes&lt;/code&gt;: Enable VLAN tagging&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bridge-vids 2-4094&lt;/code&gt;: Allow these VLANs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bridge-pvid 1&lt;/code&gt;: Native VLAN (untagged traffic)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;VMs specify their VLAN when connecting:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# In VM config (/etc/pve/qemu-server/100.conf)
net0: virtio,bridge=vmbr0,tag=100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: VM → Hardware → Network → VLAN Tag: 100&lt;/p&gt;
&lt;h3&gt;Traditional VLAN Interfaces (Older Method)&lt;/h3&gt;
&lt;p&gt;Create a sub-interface for each VLAN:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;auto eno1.100
iface eno1.100 inet manual

auto vmbr100
iface vmbr100 inet manual
    bridge-ports eno1.100
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This works but creates more interfaces. VLAN-aware bridges are cleaner.&lt;/p&gt;
&lt;h3&gt;Common VLAN Mistakes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;1. Physical switch not configured for VLANs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Your Proxmox config is perfect, but the switch port is access-mode VLAN 1. Nothing works.&lt;/p&gt;
&lt;p&gt;Fix: Configure switch port as trunk allowing your VLANs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. VLAN tag mismatch&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;VM is tagged VLAN 100, but there&apos;s no VLAN 100 on the switch.&lt;/p&gt;
&lt;p&gt;Fix: Verify VLANs exist end-to-end: switch, router, Proxmox.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Native VLAN confusion&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Management traffic is untagged (PVID), VM traffic is tagged. If PVID doesn&apos;t match switch native VLAN, management breaks.&lt;/p&gt;
&lt;p&gt;Fix: Be explicit about native VLAN on both sides.&lt;/p&gt;
&lt;h2&gt;Bonding (Link Aggregation)&lt;/h2&gt;
&lt;p&gt;Multiple NICs acting as one for redundancy or throughput.&lt;/p&gt;
&lt;h3&gt;Bonding Modes&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;balance-rr&lt;/td&gt;
&lt;td&gt;Round-robin, requires switch support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;active-backup&lt;/td&gt;
&lt;td&gt;Failover, no switch config needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;balance-xor&lt;/td&gt;
&lt;td&gt;XOR hash, requires switch support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;broadcast&lt;/td&gt;
&lt;td&gt;Send on all, niche uses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;802.3ad (LACP)&lt;/td&gt;
&lt;td&gt;Dynamic aggregation, requires switch LACP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;balance-tlb&lt;/td&gt;
&lt;td&gt;Adaptive transmit, no switch config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;balance-alb&lt;/td&gt;
&lt;td&gt;Adaptive load balancing, no switch config&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Recommended:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;mode 1 (active-backup)&lt;/strong&gt;: Simplest, works everywhere, true redundancy&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;mode 4 (LACP)&lt;/strong&gt;: Best throughput, but requires switch configuration&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Active-Backup Bond (Easy)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2
    bond-miimon 100
    bond-mode active-backup
    bond-primary eno1

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Behavior: eno1 is active, eno2 is standby. If eno1 fails, eno2 takes over in 100ms.&lt;/p&gt;
&lt;h3&gt;LACP Bond (Best Performance)&lt;/h3&gt;
&lt;p&gt;Requires switch LACP configuration first:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2
    bond-miimon 100
    bond-mode 802.3ad
    bond-lacp-rate fast
    bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Check bond status:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;cat /proc/net/bonding/bond0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Bonding Gotchas&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Single flow doesn&apos;t benefit&lt;/strong&gt;: A single TCP connection uses one link. Bonding helps aggregate throughput, not single-connection speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Switch must match&lt;/strong&gt;: LACP requires switch-side configuration. Mismatched settings = no connectivity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MII monitoring&lt;/strong&gt;: 100ms (&lt;code&gt;bond-miimon 100&lt;/code&gt;) is standard. Lower = faster failover but more CPU.&lt;/p&gt;
&lt;h2&gt;Network Isolation&lt;/h2&gt;
&lt;p&gt;Keeping networks separate is as important as connecting them.&lt;/p&gt;
&lt;h3&gt;Isolated Internal Network&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;VMs on vmbr2 can only talk to each other. No physical network, no internet.&lt;/p&gt;
&lt;p&gt;Use cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Database servers that only backends should reach&lt;/li&gt;
&lt;li&gt;Development environments&lt;/li&gt;
&lt;li&gt;Testing isolated from production&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;VMs Accessing Multiple Networks&lt;/h3&gt;
&lt;p&gt;A VM can have multiple NICs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# VM config
net0: virtio,bridge=vmbr0,tag=10      # Production VLAN
net1: virtio,bridge=vmbr2             # Internal only
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Inside VM, configure each interface appropriately.&lt;/p&gt;
&lt;h2&gt;Proxmox Host Networking&lt;/h2&gt;
&lt;p&gt;The host itself needs network access for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Management (web UI, SSH)&lt;/li&gt;
&lt;li&gt;Corosync (clustering)&lt;/li&gt;
&lt;li&gt;Storage (NFS, Ceph, iSCSI)&lt;/li&gt;
&lt;li&gt;Updates&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Management Network&lt;/h3&gt;
&lt;p&gt;Always use static IP for management:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Never DHCP for a hypervisor. You need to know where it is.&lt;/p&gt;
&lt;h3&gt;Separate Storage Network (If Needed)&lt;/h3&gt;
&lt;p&gt;For high-performance storage (Ceph, iSCSI), dedicate a network:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;auto vmbr1
iface vmbr1 inet static
    address 10.10.0.10/24
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configure storage to use this network, keeping management traffic separate.&lt;/p&gt;
&lt;h2&gt;Troubleshooting&lt;/h2&gt;
&lt;h3&gt;VM Has No Network&lt;/h3&gt;
&lt;p&gt;Check in order:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# 1. Is the bridge up?
ip link show vmbr0

# 2. Is the physical port up?
ip link show eno1

# 3. Is the VM&apos;s tap interface in the bridge?
bridge link show

# 4. Inside VM, is the interface up?
# (via console, not SSH since network is broken)
ip a

# 5. Can VM ping gateway?
ping 10.0.0.1

# 6. Check for VLAN issues
tcpdump -i vmbr0 -n icmp
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Traffic Not Reaching VM&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check bridge forwarding
sysctl net.bridge.bridge-nf-call-iptables
# Should be 0, or iptables might interfere

# Check VM&apos;s interface is in bridge
bridge link show master vmbr0
# Should list tap100i0, tap101i0, etc.

# Capture on bridge
tcpdump -i vmbr0 host 10.0.0.100 -n
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;VLAN Traffic Not Working&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Check VLAN-aware bridge
bridge vlan show

# Should show VLANs per port:
# vmbr0    1 PVID Egress Untagged
# eno1     1 PVID Egress Untagged
#          100
#          200
# tap100i0 100 PVID Egress Untagged

# If VMs VLAN not listed, check VM config
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;My Network Layout&lt;/h2&gt;
&lt;h3&gt;Simple Homelab&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;ISP Router (10.0.0.1)
     │
     └── eno1
          │
     ┌────┴────┐
     │  vmbr0  │ 10.0.0.10 (Proxmox)
     │ (bridge)│
     ├─────────┤
     │ VM1     │ 10.0.0.101 (VLAN-aware, tag=1)
     │ VM2     │ 10.0.0.102 (VLAN-aware, tag=1)
     └─────────┘
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Flat network. Simple. Everything on same subnet.&lt;/p&gt;
&lt;h3&gt;Production with VLANs&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Core Switch (trunk port)
     │ VLANs: 10, 20, 30, 100
     │
     └── eno1
          │
     ┌────┴────────────────────┐
     │        vmbr0            │ VLAN-aware
     │ (PVID 10 = management)  │
     ├─────────────────────────┤
     │ tag=10: Management VMs  │
     │ tag=20: Production VMs  │
     │ tag=30: DMZ             │
     │ tag=100: Storage        │
     └─────────────────────────┘

Proxmox management: 10.10.0.10/24 (VLAN 10, untagged on bridge)
Production VMs:     10.20.0.0/24 (VLAN 20)
DMZ VMs:           10.30.0.0/24 (VLAN 30)
Storage network:    10.100.0.0/24 (VLAN 100)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One physical NIC, multiple isolated networks.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;99% of virtualization network problems are inconsistent Layer 2.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The config looks right. Proxmox is configured. VMs have IPs. But nothing works. Why?&lt;/p&gt;
&lt;p&gt;Because somewhere in the chain — switch port, VLAN configuration, bridge settings, VM tag — something doesn&apos;t match.&lt;/p&gt;
&lt;p&gt;Virtualization networking requires consistency:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Switch port must trunk the right VLANs&lt;/li&gt;
&lt;li&gt;Bridge must be VLAN-aware if you&apos;re tagging&lt;/li&gt;
&lt;li&gt;VM must use the correct tag&lt;/li&gt;
&lt;li&gt;Physical network must route between VLANs (if needed)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When it breaks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start at the physical layer (is the cable plugged in?)&lt;/li&gt;
&lt;li&gt;Check switch configuration (is VLAN allowed?)&lt;/li&gt;
&lt;li&gt;Check bridge configuration (is VLAN-aware enabled?)&lt;/li&gt;
&lt;li&gt;Check VM configuration (is tag correct?)&lt;/li&gt;
&lt;li&gt;Check inside VM (is interface up?)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The fix is almost always a mismatch somewhere. Find it, fix it, document it so you don&apos;t repeat it.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>networking</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Storage 101: Local, ZFS, LVM-thin — What I Actually Use and Why</title><link>https://ashimov.com/posts/proxmox-storage/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-storage/</guid><description>Practical guide to Proxmox storage options. Covers local directory, LVM-thin, ZFS pools, when to use each, snapshot limitations, and why fast storage is often fragile storage.</description><pubDate>Fri, 08 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Storage decisions in Proxmox affect everything downstream. Choose wrong and you&apos;re either rebuilding later or living with limitations. Choose right and you forget storage exists — it just works.&lt;/p&gt;
&lt;p&gt;The problem is &quot;right&quot; depends on your use case. ZFS is amazing until your 8GB RAM server starts swapping. LVM-thin is fast until you need to migrate VMs. Directory storage is simple until you want snapshots.&lt;/p&gt;
&lt;p&gt;This is what I actually use and why, after trying all of them.&lt;/p&gt;
&lt;h2&gt;Storage Types in Proxmox&lt;/h2&gt;
&lt;p&gt;Proxmox supports multiple storage backends. Each has trade-offs:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Snapshots&lt;/th&gt;
&lt;th&gt;Live Backup&lt;/th&gt;
&lt;th&gt;Thin Provisioning&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Directory&lt;/td&gt;
&lt;td&gt;No*&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Simple, on any filesystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LVM&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Block storage, no snapshots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LVM-thin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Block storage with thin volumes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZFS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Best features, needs RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ceph&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Distributed, complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NFS/CIFS&lt;/td&gt;
&lt;td&gt;No*&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Depends&lt;/td&gt;
&lt;td&gt;Network storage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;*Can use qcow2 format for snapshots, but slower&lt;/p&gt;
&lt;h2&gt;What Gets Stored Where&lt;/h2&gt;
&lt;p&gt;Before diving into backends, understand what you&apos;re storing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;VM Disks&lt;/strong&gt;: The actual virtual hard drives. Performance critical.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ISO Images&lt;/strong&gt;: Installation media. Read-once, performance doesn&apos;t matter.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container Templates&lt;/strong&gt;: LXC images. Small, read occasionally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Backups&lt;/strong&gt;: Compressed VM snapshots. Large, written sequentially.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snippets&lt;/strong&gt;: Cloud-init configs, hook scripts. Tiny files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Not everything needs fast storage. Putting ISOs on your NVMe ZFS pool wastes space.&lt;/p&gt;
&lt;h2&gt;Directory Storage&lt;/h2&gt;
&lt;p&gt;The simplest option. Just a folder on a filesystem.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Default directories after install
/var/lib/vz/template/iso      # ISO images
/var/lib/vz/template/cache    # Container templates
/var/lib/vz/dump              # Backups
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use Directory Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;ISO images (read once during install)&lt;/li&gt;
&lt;li&gt;Container templates (read once during creation)&lt;/li&gt;
&lt;li&gt;Backups (sequential writes, then archive)&lt;/li&gt;
&lt;li&gt;Small deployments where simplicity matters&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Directory Limitations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;No atomic snapshots (unless using qcow2 format, which is slower)&lt;/li&gt;
&lt;li&gt;No thin provisioning — disk images use actual space&lt;/li&gt;
&lt;li&gt;Performance depends entirely on underlying filesystem&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Adding Directory Storage&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create directory
mkdir -p /mnt/storage/proxmox

# Add to Proxmox
pvesm add dir backup-storage --path /mnt/storage/proxmox --content backup,iso,vztmpl
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: Datacenter → Storage → Add → Directory&lt;/p&gt;
&lt;h2&gt;LVM-thin&lt;/h2&gt;
&lt;p&gt;LVM with thin provisioning. You allocate a pool, then create thin volumes that share space.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Physical Disk (500GB)
└── Volume Group: pve
    └── Thin Pool: data (400GB allocated)
        ├── VM 100 disk (100GB virtual, 20GB actual)
        ├── VM 101 disk (100GB virtual, 35GB actual)
        └── VM 102 disk (100GB virtual, 15GB actual)
        → Total actual usage: 70GB in 400GB pool
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;LVM-thin Advantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Thin provisioning&lt;/strong&gt;: Allocate more than you have, pay for what you use&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots&lt;/strong&gt;: LVM snapshots work (with caveats)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Speed&lt;/strong&gt;: Direct block access, no filesystem overhead&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Low memory&lt;/strong&gt;: No significant RAM overhead&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;LVM-thin Disadvantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No checksums&lt;/strong&gt;: Data corruption is silent&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot overhead&lt;/strong&gt;: Snapshots slow down writes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pool can fill&lt;/strong&gt;: Over-provisioning requires monitoring&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migration complexity&lt;/strong&gt;: Moving thin volumes isn&apos;t trivial&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Default LVM-thin Setup&lt;/h3&gt;
&lt;p&gt;Proxmox installer creates this automatically:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check LVM-thin pool
lvs
# NAME    VG  Attr       LSize
# data    pve twi-aotz-- 400g

# Check thin pool usage
lvs -o+data_percent
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When LVM-thin Pool Fills&lt;/h3&gt;
&lt;p&gt;This is the danger zone. If your thin pool hits 100%, VMs pause or corrupt.&lt;/p&gt;
&lt;p&gt;Monitor it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check usage
lvs -o name,size,data_percent pve/data

# Set up alert (add to cron)
USAGE=$(lvs --noheadings -o data_percent pve/data | tr -d &apos; %&apos;)
if [ &quot;$USAGE&quot; -gt 80 ]; then
    echo &quot;LVM thin pool at ${USAGE}%&quot; | mail -s &quot;Storage Alert&quot; admin@example.com
fi
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;ZFS&lt;/h2&gt;
&lt;p&gt;My preferred choice for most deployments. ZFS is a filesystem and volume manager combined.&lt;/p&gt;
&lt;h3&gt;ZFS Advantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Checksums&lt;/strong&gt;: Every block is verified, silent corruption is detected&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots&lt;/strong&gt;: Instant, cheap, no performance penalty during creation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression&lt;/strong&gt;: lz4 compression is basically free (often faster than uncompressed!)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Send/Receive&lt;/strong&gt;: Efficient replication to another system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Self-healing&lt;/strong&gt;: With redundancy, bad blocks are automatically repaired&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;ZFS Disadvantages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;RAM hungry&lt;/strong&gt;: Wants 1GB+ per TB of storage for optimal ARC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU for compression&lt;/strong&gt;: Minimal with lz4, noticeable with zstd&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Complexity&lt;/strong&gt;: More knobs to understand&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No shrink&lt;/strong&gt;: Can&apos;t reduce pool size&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;ZFS Pool Status&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Pool health
zpool status
#   pool: rpool
#   state: ONLINE
#   config:
#     NAME         STATE     READ WRITE CKSUM
#     rpool        ONLINE       0     0     0
#       nvme0n1p3  ONLINE       0     0     0

# Space usage
zfs list
# NAME                     USED  AVAIL  REFER  MOUNTPOINT
# rpool                    120G   280G    96K  /rpool
# rpool/ROOT               50G    280G    96K  /rpool/ROOT
# rpool/data               70G    280G    96K  /rpool/data

# Check compression ratio
zfs get compressratio rpool/data
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Tuning ZFS for Proxmox&lt;/h3&gt;
&lt;p&gt;Limit ARC to leave RAM for VMs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check current ARC size
arc_summary | grep &quot;ARC size&quot;

# Limit to 4GB (adjust based on your RAM)
echo &quot;options zfs zfs_arc_max=4294967296&quot; &amp;gt; /etc/modprobe.d/zfs.conf
update-initramfs -u
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rule of thumb: Give ARC 1GB per TB of storage, minimum 1GB, maximum 50% of RAM.&lt;/p&gt;
&lt;h3&gt;Adding ZFS Storage&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Create new pool on separate disk
zpool create -o ashift=12 tank /dev/sdb

# Enable compression
zfs set compression=lz4 tank

# Create dataset for VMs
zfs create tank/vms

# Add to Proxmox
pvesm add zfspool tank-vms -pool tank/vms --content images,rootdir
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;My Actual Setup&lt;/h2&gt;
&lt;h3&gt;Single Node Homelab&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;NVMe 500GB (rpool)
├── rpool/ROOT       # Proxmox OS (50GB)
└── rpool/data       # VM disks (ZFS, compression, snapshots)

SATA SSD 1TB (tank)
└── tank/backups     # Backups (directory storage)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Why:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ZFS for VM disks — I want snapshots and checksums&lt;/li&gt;
&lt;li&gt;Separate disk for backups — if rpool dies, backups survive&lt;/li&gt;
&lt;li&gt;Compression saves 20-40% space on typical workloads&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Production Cluster&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;NVMe 256GB (rpool)
└── rpool/ROOT       # OS only, small and fast

2x SATA SSD 1TB (mirror, vmpool)
└── vmpool/data      # VM disks with redundancy

2x HDD 4TB (mirror, backup)
└── backup/proxmox   # Proxmox Backup Server storage
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Why:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Separate OS from VMs — OS disk failure doesn&apos;t lose VMs&lt;/li&gt;
&lt;li&gt;Mirrors for redundancy — single disk failure = no downtime&lt;/li&gt;
&lt;li&gt;HDDs for backups — capacity over speed, write once read rarely&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Snapshots vs Backups&lt;/h2&gt;
&lt;p&gt;This is where people get confused.&lt;/p&gt;
&lt;h3&gt;Snapshots Are Not Backups&lt;/h3&gt;
&lt;p&gt;A snapshot is a point-in-time view stored on the same disk:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create ZFS snapshot
zfs snapshot rpool/data/vm-100-disk-0@before-upgrade

# List snapshots
zfs list -t snapshot

# Rollback
zfs rollback rpool/data/vm-100-disk-0@before-upgrade
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Snapshots are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Instant&lt;/strong&gt;: No performance penalty to create&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Same disk&lt;/strong&gt;: If disk dies, snapshots die too&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;For rollback&lt;/strong&gt;: Made a bad change? Roll back in seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Snapshots are NOT:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Off-site&lt;/strong&gt;: They&apos;re on the same physical disk&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disaster recovery&lt;/strong&gt;: Disk failure loses everything&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-term retention&lt;/strong&gt;: Too many snapshots = space + performance issues&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Backups Are Copies Elsewhere&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Proxmox backup (stores on backup storage)
vzdump 100 --storage backup-storage --mode snapshot

# ZFS send to another system
zfs send rpool/data/vm-100-disk-0@backup | ssh backup-server zfs recv tank/backups/vm-100
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Backups are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Off-system&lt;/strong&gt;: Different disk, different machine, different building&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disaster recovery&lt;/strong&gt;: Original dies, restore from backup&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Long-term&lt;/strong&gt;: Keep 30 days, 12 weeks, whatever you need&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use both.&lt;/strong&gt; Snapshots for quick rollbacks (before upgrades, config changes). Backups for disaster recovery.&lt;/p&gt;
&lt;h2&gt;Storage Performance&lt;/h2&gt;
&lt;h3&gt;Testing Your Storage&lt;/h3&gt;
&lt;p&gt;Before putting workloads on storage, benchmark it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Install fio
apt install fio

# Random 4K writes (database-like)
fio --name=rand-write --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting --filename=/rpool/data/test.fio

# Sequential writes (backup-like)
fio --name=seq-write --ioengine=libaio --iodepth=1 --rw=write --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=60 --group_reporting --filename=/rpool/data/test.fio

# Cleanup
rm /rpool/data/test.fio
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Typical numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NVMe: 500K+ IOPS random, 3GB/s+ sequential&lt;/li&gt;
&lt;li&gt;SATA SSD: 50K IOPS random, 500MB/s sequential&lt;/li&gt;
&lt;li&gt;HDD: 150 IOPS random, 150MB/s sequential&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Fast Often Means Fragile&lt;/h3&gt;
&lt;p&gt;High-performance storage often sacrifices safety:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;NVMe without power loss protection&lt;/strong&gt;: Data corruption on power loss&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write caching without battery backup&lt;/strong&gt;: Same problem&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumer SSDs&lt;/strong&gt;: Not designed for write-heavy workloads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For VMs that matter, use enterprise SSDs with power loss protection or ZFS with a proper setup (mirrors, proper RAM, UPS).&lt;/p&gt;
&lt;h2&gt;Storage Migration&lt;/h2&gt;
&lt;p&gt;Need to move VMs between storage backends?&lt;/p&gt;
&lt;h3&gt;Online Migration (VM Running)&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Move disk to different storage
qm move_disk 100 scsi0 target-storage

# Or via Web UI: VM → Hardware → Disk → Move Storage
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Offline Migration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;# Stop VM
qm stop 100

# Export/import
qm export 100 /tmp/vm-100.tar.gz
qm import 100 /tmp/vm-100.tar.gz target-storage
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;ZFS Send/Receive&lt;/h3&gt;
&lt;p&gt;For ZFS-to-ZFS, this is most efficient:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Send to remote
zfs send rpool/data/vm-100-disk-0@migrate | ssh target-host zfs recv tank/data/vm-100-disk-0
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Snapshots are not backups. And fast often means fragile.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Storage is where data lives. Get it wrong and you lose everything. The temptation is to optimize for speed — NVMe everything, no redundancy, maximum performance.&lt;/p&gt;
&lt;p&gt;Then a disk fails. Or worse, corrupts silently. And you discover that your snapshots were on the same disk that died.&lt;/p&gt;
&lt;p&gt;My approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;ZFS for data that matters&lt;/strong&gt; — checksums catch corruption&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mirrors for production&lt;/strong&gt; — single disk failure = no panic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separate backup storage&lt;/strong&gt; — not on the same disk, not on the same host&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test restores&lt;/strong&gt; — a backup you haven&apos;t restored is a backup you hope works&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The boring, redundant setup survives. The fast, minimal setup survives until it doesn&apos;t.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>homelab</category><category>storage</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Post-Install Baseline: Users, SSH, Firewall, Updates, and Hardening</title><link>https://ashimov.com/posts/proxmox-security/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-security/</guid><description>Essential Proxmox security hardening after installation. Covers user management, SSH key-only access, host firewall configuration, automatic updates, and why security is easier to implement now than later.</description><pubDate>Tue, 05 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;A fresh Proxmox install works. You can create VMs, manage storage, access the web UI. But &quot;works&quot; isn&apos;t &quot;secure.&quot; The default configuration prioritizes convenience over hardening. That&apos;s fine for the installer — you need access to finish setup. It&apos;s not fine for production.&lt;/p&gt;
&lt;p&gt;Security is easier to implement now, in the first hour, than &quot;someday later.&quot; Later never comes. And when it does, it&apos;s usually because something bad happened.&lt;/p&gt;
&lt;p&gt;This is the post-install hardening I do on every Proxmox host before it runs any workload.&lt;/p&gt;
&lt;h2&gt;User Management&lt;/h2&gt;
&lt;h3&gt;Stop Using Root for Everything&lt;/h3&gt;
&lt;p&gt;The web UI logs you in as root. SSH defaults to root. This works, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Root has no audit trail (who did what?)&lt;/li&gt;
&lt;li&gt;Root can destroy everything with one typo&lt;/li&gt;
&lt;li&gt;Shared root password = no accountability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Create personal admin accounts:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create user with admin access
useradd -m -s /bin/bash admin
passwd admin

# Add to sudo group
usermod -aG sudo admin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now add this user to Proxmox:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Create Proxmox user (realm = pam for Linux users)
pveum user add admin@pam

# Grant Administrator role
pveum acl modify / --user admin@pam --role Administrator
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can now log into the web UI as &lt;code&gt;admin@pam&lt;/code&gt; instead of root.&lt;/p&gt;
&lt;h3&gt;Proxmox Authentication Realms&lt;/h3&gt;
&lt;p&gt;Proxmox has multiple authentication realms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;pam&lt;/strong&gt;: Linux system users. Best for admins who also need SSH.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pve&lt;/strong&gt;: Proxmox internal users. Web UI only, no SSH access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LDAP/AD&lt;/strong&gt;: Enterprise directory integration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For small deployments, PAM users are simplest. One password for SSH and web UI.&lt;/p&gt;
&lt;h3&gt;Two-Factor Authentication&lt;/h3&gt;
&lt;p&gt;Enable 2FA for all admin accounts:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Web UI → Datacenter → Permissions → Two Factor&lt;/li&gt;
&lt;li&gt;Add TOTP for your user&lt;/li&gt;
&lt;li&gt;Scan QR code with authenticator app&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This protects against password compromise. Even if someone gets your password, they need the TOTP code.&lt;/p&gt;
&lt;h2&gt;SSH Hardening&lt;/h2&gt;
&lt;p&gt;Default SSH is password authentication as root. Every botnet on the internet is scanning for this.&lt;/p&gt;
&lt;h3&gt;Generate SSH Keys&lt;/h3&gt;
&lt;p&gt;On your workstation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ssh-keygen -t ed25519 -C &quot;admin@proxmox&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy to the server:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ssh-copy-id admin@pve1.lab.local
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Harden sshd_config&lt;/h3&gt;
&lt;p&gt;Edit &lt;code&gt;/etc/ssh/sshd_config&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Disable root login
PermitRootLogin no

# Disable password authentication
PasswordAuthentication no

# Only allow specific users
AllowUsers admin

# Use only strong algorithms
KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group16-sha512
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com

# Reduce login grace time
LoginGraceTime 30

# Limit authentication attempts
MaxAuthTries 3

# Disable unused features
X11Forwarding no
AllowTcpForwarding no
AllowAgentForwarding no
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply changes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;systemctl restart sshd
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Test before disconnecting.&lt;/strong&gt; Open a new terminal, verify you can log in with your key. Don&apos;t lock yourself out.&lt;/p&gt;
&lt;h3&gt;Fail2Ban (Optional but Recommended)&lt;/h3&gt;
&lt;p&gt;Even with key-only auth, bots still try. Fail2ban reduces log noise:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;apt install fail2ban

# Create local config
cat &amp;gt; /etc/fail2ban/jail.local &amp;lt;&amp;lt; &apos;EOF&apos;
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
EOF

systemctl enable --now fail2ban
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Check banned IPs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;fail2ban-client status sshd
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Host Firewall&lt;/h2&gt;
&lt;p&gt;Proxmox VMs get their own firewall (we&apos;ll cover that later). But the host itself needs protection too.&lt;/p&gt;
&lt;h3&gt;Proxmox Built-in Firewall&lt;/h3&gt;
&lt;p&gt;Proxmox has a built-in firewall. Enable it for the datacenter and node:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Enable firewall at datacenter level
pvesh set /cluster/firewall/options --enable 1

# Enable for this node
pvesh set /nodes/pve1/firewall/options --enable 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or via Web UI: Datacenter → Firewall → Options → Enable: Yes&lt;/p&gt;
&lt;h3&gt;Default Policies&lt;/h3&gt;
&lt;p&gt;Set default to drop, then allow what you need:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Set default input policy to DROP
pvesh set /cluster/firewall/options --policy_in DROP

# Allow established connections
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --macro ESTABLISHED

# Allow SSH
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 22 --proto tcp --comment &quot;SSH&quot;

# Allow Proxmox web UI
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 8006 --proto tcp --comment &quot;Proxmox Web UI&quot;

# Allow ICMP (ping)
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --proto icmp --comment &quot;ICMP&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Management Network Restriction&lt;/h3&gt;
&lt;p&gt;Better: restrict management access to specific subnets:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Only allow SSH from management network
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 22 --proto tcp --source 10.0.0.0/24 --comment &quot;SSH from mgmt&quot;

# Only allow web UI from management network
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 8006 --proto tcp --source 10.0.0.0/24 --comment &quot;Web UI from mgmt&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Cluster Communication&lt;/h3&gt;
&lt;p&gt;If you&apos;re clustering, allow inter-node traffic:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Corosync
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 5405:5412 --proto udp --source 10.0.0.0/24 --comment &quot;Corosync&quot;

# Live migration
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 60000:60050 --proto tcp --source 10.0.0.0/24 --comment &quot;Migration&quot;

# Proxmox cluster API
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 85 --proto tcp --source 10.0.0.0/24 --comment &quot;Cluster API&quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Automatic Updates&lt;/h2&gt;
&lt;p&gt;Security updates shouldn&apos;t wait for you to remember. Automate them.&lt;/p&gt;
&lt;h3&gt;Unattended Upgrades&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;apt install unattended-upgrades apt-listchanges

# Enable automatic updates
dpkg-reconfigure -plow unattended-upgrades
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configure &lt;code&gt;/etc/apt/apt.conf.d/50unattended-upgrades&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Unattended-Upgrade::Allowed-Origins {
    &quot;${distro_id}:${distro_codename}&quot;;
    &quot;${distro_id}:${distro_codename}-security&quot;;
    &quot;${distro_id}:${distro_codename}-updates&quot;;
    &quot;Proxmox:${distro_codename}&quot;;
};

// Email notification
Unattended-Upgrade::Mail &quot;admin@example.com&quot;;
Unattended-Upgrade::MailReport &quot;on-change&quot;;

// Don&apos;t auto-reboot (hypervisor needs planned reboots)
Unattended-Upgrade::Automatic-Reboot &quot;false&quot;;

// Remove unused dependencies
Unattended-Upgrade::Remove-Unused-Dependencies &quot;true&quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Manual Update Workflow&lt;/h3&gt;
&lt;p&gt;For major updates or kernel changes, manual process is safer:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check what will be updated
apt update
apt list --upgradable

# Apply updates
apt full-upgrade

# Check if reboot needed
cat /var/run/reboot-required 2&amp;gt;/dev/null &amp;amp;&amp;amp; echo &quot;Reboot required&quot;

# Check Proxmox version
pveversion -v
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule maintenance windows. Don&apos;t reboot during business hours if you can avoid it.&lt;/p&gt;
&lt;h2&gt;Additional Hardening&lt;/h2&gt;
&lt;h3&gt;Disable Unused Services&lt;/h3&gt;
&lt;p&gt;Check what&apos;s listening:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ss -tlnp
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you&apos;re not using Spice or VNC consoles directly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Don&apos;t disable if you use them!
# systemctl disable --now spiceproxy
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Kernel Parameters&lt;/h3&gt;
&lt;p&gt;Add to &lt;code&gt;/etc/sysctl.d/99-security.conf&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Disable IP forwarding (enable if needed for routing VMs)
# net.ipv4.ip_forward = 0

# Ignore ICMP redirects
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0

# Don&apos;t send ICMP redirects
net.ipv4.conf.all.send_redirects = 0

# Enable SYN flood protection
net.ipv4.tcp_syncookies = 1

# Log martian packets
net.ipv4.conf.all.log_martians = 1

# Ignore broadcast pings
net.ipv4.icmp_echo_ignore_broadcasts = 1

# Restrict dmesg
kernel.dmesg_restrict = 1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sysctl -p /etc/sysctl.d/99-security.conf
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Filesystem Hardening&lt;/h3&gt;
&lt;p&gt;Mount options for security:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Check current mounts
mount | grep -E &apos;ext4|zfs&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For non-ZFS systems, consider noexec on /tmp. For ZFS, Proxmox handles mount options appropriately.&lt;/p&gt;
&lt;h3&gt;Audit Logging&lt;/h3&gt;
&lt;p&gt;Install and configure auditd for compliance requirements:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;apt install auditd audispd-plugins

# Basic rules
cat &amp;gt;&amp;gt; /etc/audit/rules.d/audit.rules &amp;lt;&amp;lt; &apos;EOF&apos;
# Monitor sudo usage
-w /etc/sudoers -p wa -k sudoers
-w /etc/sudoers.d/ -p wa -k sudoers

# Monitor SSH config
-w /etc/ssh/sshd_config -p wa -k sshd

# Monitor user/group changes
-w /etc/passwd -p wa -k passwd
-w /etc/group -p wa -k group
EOF

systemctl restart auditd
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Backup Your Config&lt;/h2&gt;
&lt;p&gt;Before you forget what you configured:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Backup host config
tar -czf /root/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/ssh /etc/apt

# Copy off-host
scp /root/pve-config-*.tar.gz backup-server:/backups/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Do this after any significant configuration change.&lt;/p&gt;
&lt;h2&gt;Security Checklist&lt;/h2&gt;
&lt;p&gt;Run through this after every install:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[ ] Non-root admin user created
[ ] Admin user has 2FA enabled
[ ] SSH key-only authentication
[ ] Root SSH login disabled
[ ] Host firewall enabled
[ ] Management access restricted to trusted networks
[ ] Unattended security updates configured
[ ] Fail2ban installed (optional)
[ ] Initial config backed up
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What I Don&apos;t Do&lt;/h2&gt;
&lt;p&gt;Some hardening guides go overboard. Here&apos;s what I skip and why:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Change SSH port&lt;/strong&gt;: Security through obscurity. Attackers scan all ports. Fail2ban is more effective.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Disable IPv6&lt;/strong&gt;: If you&apos;re not using it, fine. But disabling often breaks things in unexpected ways.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complex MAC/SELinux policies&lt;/strong&gt;: On a hypervisor, the VMs are the workload. Host runs minimal services. Default policies are usually sufficient.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Intrusion detection (OSSEC, etc.)&lt;/strong&gt;: Good for compliance. For homelab, log monitoring and backups are more practical.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Security is easier to do now than &quot;someday later.&quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A fresh install is a blank slate. Every change you make is documented in your head. Wait a month, and you&apos;ve forgotten what&apos;s default and what you configured. Wait a year, and the system is running workloads you&apos;re afraid to touch.&lt;/p&gt;
&lt;p&gt;The first hour after install is when hardening is easiest:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You haven&apos;t forgotten what&apos;s there&lt;/li&gt;
&lt;li&gt;No workloads depend on insecure defaults&lt;/li&gt;
&lt;li&gt;Changes don&apos;t require maintenance windows&lt;/li&gt;
&lt;li&gt;Documentation is fresh&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do it now. These configurations survive upgrades. The 30 minutes you spend today prevent the 3 AM incident next year.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>firewall</category><category>security</category><author>berik@ashimov.com (Berik Ashimov)</author></item><item><title>Why I Chose Proxmox (and How to Install It the Boring, Correct Way)</title><link>https://ashimov.com/posts/proxmox-install/</link><guid isPermaLink="true">https://ashimov.com/posts/proxmox-install/</guid><description>Proxmox VE installation done right. Covers disk layout decisions, ZFS vs LVM vs ext4, network configuration, repository setup, and why the boring install is the one that survives upgrades.</description><pubDate>Fri, 01 Aug 2025 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I&apos;ve run ESXi, Hyper-V, oVirt, and plain KVM with libvirt. They all work. But when Broadcom acquired VMware and started the licensing chaos, I moved everything to Proxmox. Not because it&apos;s trendy — because it&apos;s boring in the best way.&lt;/p&gt;
&lt;p&gt;Proxmox is Debian with a web UI and good defaults. When things break (and they will), you&apos;re debugging Linux, not a proprietary hypervisor. The skills transfer. The logs make sense. The community has seen your problem before.&lt;/p&gt;
&lt;p&gt;This is how to install Proxmox in a way that doesn&apos;t create pain later.&lt;/p&gt;
&lt;h2&gt;Why Proxmox&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;It&apos;s just Debian.&lt;/strong&gt; Under the web UI, it&apos;s apt, systemd, and standard Linux networking. When the UI doesn&apos;t do what you need, drop to the shell.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ZFS first-class.&lt;/strong&gt; Built-in ZFS support with proper integration. Snapshots, replication, compression — all accessible from the UI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No licensing games.&lt;/strong&gt; The &quot;enterprise&quot; repository requires a subscription, but the no-subscription repository works fine. You&apos;re not crippled without paying.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clustering is free.&lt;/strong&gt; Three nodes, shared storage, HA — no extra licenses.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Both VMs and containers.&lt;/strong&gt; KVM for full VMs, LXC for lightweight containers. Same management interface.&lt;/p&gt;
&lt;h2&gt;Before You Install&lt;/h2&gt;
&lt;h3&gt;Hardware Considerations&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;CPU:&lt;/strong&gt; Intel or AMD with virtualization extensions (VT-x/AMD-V). Check BIOS — these are sometimes disabled by default.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RAM:&lt;/strong&gt; Minimum 8GB for the host, but realistically 32GB+ for anything useful. ECC recommended for ZFS, not required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Storage:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Boot drive: SSD, 32GB minimum (128GB comfortable)&lt;/li&gt;
&lt;li&gt;VM storage: Separate drive(s), SSD strongly preferred&lt;/li&gt;
&lt;li&gt;ZFS: Wants multiple drives for redundancy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Network:&lt;/strong&gt; Dedicated NIC for management, additional for VM traffic if you&apos;re serious.&lt;/p&gt;
&lt;h3&gt;The Decision: ZFS vs LVM vs ext4&lt;/h3&gt;
&lt;p&gt;This is the first fork in the road. Choose wrong and you&apos;ll reinstall later.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ZFS (my choice for most cases):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Built-in checksums, catches silent corruption&lt;/li&gt;
&lt;li&gt;Snapshots are instant and cheap&lt;/li&gt;
&lt;li&gt;Compression saves space with minimal CPU overhead&lt;/li&gt;
&lt;li&gt;Replication to another node is trivial&lt;/li&gt;
&lt;li&gt;Requires more RAM (1GB per TB of storage, roughly)&lt;/li&gt;
&lt;li&gt;Single-disk ZFS works fine, just no redundancy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;LVM-thin:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less RAM overhead&lt;/li&gt;
&lt;li&gt;Snapshots work but less elegant&lt;/li&gt;
&lt;li&gt;No checksums&lt;/li&gt;
&lt;li&gt;Familiar if you know LVM&lt;/li&gt;
&lt;li&gt;Good choice for simple setups or low-RAM systems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;ext4/XFS on raw disk:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simplest&lt;/li&gt;
&lt;li&gt;No snapshots without external tools&lt;/li&gt;
&lt;li&gt;Fine for the boot drive if VMs live elsewhere&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;My recommendation:&lt;/strong&gt; ZFS unless you have less than 16GB RAM or specific reasons not to. The data integrity alone is worth it.&lt;/p&gt;
&lt;h2&gt;Installation&lt;/h2&gt;
&lt;p&gt;Download the ISO from proxmox.com. Write it to USB with &lt;code&gt;dd&lt;/code&gt;, Rufus, or Etcher.&lt;/p&gt;
&lt;h3&gt;Boot and Initial Screens&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Boot from USB&lt;/li&gt;
&lt;li&gt;Select &quot;Install Proxmox VE&quot;&lt;/li&gt;
&lt;li&gt;Accept EULA&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Target Disk Selection&lt;/h3&gt;
&lt;p&gt;This is where most people make mistakes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Single disk:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Target Harddisk: /dev/sda
Filesystem: zfs (RAID0)   # Yes, RAID0 for single disk
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;RAID0 on one disk sounds wrong, but it&apos;s just &quot;use this one disk with ZFS.&quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multiple disks for redundancy:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Filesystem: zfs (RAID1)   # Mirror, needs 2+ disks
# or
Filesystem: zfs (RAIDZ-1) # Needs 3+ disks
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Advanced options (click &quot;Options&quot; button):&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;ashift: 12           # Correct for most SSDs (4K sectors)
compress: lz4        # Basically free compression
checksum: on         # Never turn this off
copies: 1            # 2 for paranoid, uses 2x space
hdsize: &amp;lt;leave blank or set limit&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you&apos;re using NVMe, &lt;code&gt;ashift=12&lt;/code&gt; is still correct for most drives.&lt;/p&gt;
&lt;h3&gt;Network Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;Management Interface: eno1 (or whatever your NIC is)
Hostname (FQDN): pve1.lab.local
IP Address: 192.168.1.10
Netmask: 255.255.255.0
Gateway: 192.168.1.1
DNS Server: 192.168.1.1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Use a static IP.&lt;/strong&gt; DHCP for a hypervisor is asking for trouble.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;FQDN matters.&lt;/strong&gt; Clustering uses hostnames. Get it right now or fix it painfully later.&lt;/p&gt;
&lt;h3&gt;Timezone and Password&lt;/h3&gt;
&lt;p&gt;Set your timezone. Set a strong root password. You&apos;ll create non-root users later.&lt;/p&gt;
&lt;h3&gt;Installation Completes&lt;/h3&gt;
&lt;p&gt;Remove USB, reboot. Access web UI at &lt;code&gt;https://&amp;lt;ip&amp;gt;:8006&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Post-Install: Repository Configuration&lt;/h2&gt;
&lt;p&gt;Default install points to the enterprise repository, which requires a subscription. You&apos;ll see errors during updates. Fix this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Disable enterprise repository
mv /etc/apt/sources.list.d/pve-enterprise.list /etc/apt/sources.list.d/pve-enterprise.list.disabled

# Add no-subscription repository
echo &quot;deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription&quot; &amp;gt; /etc/apt/sources.list.d/pve-no-subscription.list

# Update
apt update &amp;amp;&amp;amp; apt full-upgrade -y
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Ceph (if you&apos;re using it):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Same pattern
mv /etc/apt/sources.list.d/ceph.list /etc/apt/sources.list.d/ceph.list.disabled
echo &quot;deb http://download.proxmox.com/debian/ceph-quincy bookworm no-subscription&quot; &amp;gt; /etc/apt/sources.list.d/ceph-no-subscription.list
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Verify Installation&lt;/h2&gt;
&lt;h3&gt;Check ZFS&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;zpool status
# Should show your rpool, healthy

zfs list
# Should show rpool and rpool/data
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check VM Storage&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;pvesm status
# Should show local, local-lvm (or local-zfs)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Networking&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;ip a
# Should show vmbr0 bridge with your management IP
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Check Services&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;systemctl status pvedaemon
systemctl status pveproxy
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Initial Configuration via Web UI&lt;/h2&gt;
&lt;p&gt;Navigate to &lt;code&gt;https://&amp;lt;ip&amp;gt;:8006&lt;/code&gt;, login as root.&lt;/p&gt;
&lt;h3&gt;Datacenter → Storage&lt;/h3&gt;
&lt;p&gt;You&apos;ll see:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;local&lt;/code&gt;: ISO images, container templates, backups (directory storage)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;local-lvm&lt;/code&gt; or &lt;code&gt;local-zfs&lt;/code&gt;: VM disks (block storage)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is fine to start. We&apos;ll discuss storage architecture later.&lt;/p&gt;
&lt;h3&gt;Node → System → DNS&lt;/h3&gt;
&lt;p&gt;Verify DNS is correct. Add a search domain if needed.&lt;/p&gt;
&lt;h3&gt;Node → System → Time&lt;/h3&gt;
&lt;p&gt;Verify timezone. NTP is configured by default (systemd-timesyncd).&lt;/p&gt;
&lt;h3&gt;Subscription Nag&lt;/h3&gt;
&lt;p&gt;You&apos;ll see a subscription popup on login. This is expected without a subscription. It&apos;s just a nag, not a limitation.&lt;/p&gt;
&lt;p&gt;To remove it (optional, slightly hacky):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# This modifies the JS file - breaks on updates, needs reapply
sed -Ezi.bak &quot;s/(Ext\.Msg\.show\(\{[^}]+title: gettext\(&apos;No valid sub)/void\(\{ \/\/ \1/g&quot; /usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js
systemctl restart pveproxy
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I don&apos;t bother. It&apos;s one click to dismiss.&lt;/p&gt;
&lt;h2&gt;The Disk Layout I Actually Use&lt;/h2&gt;
&lt;p&gt;For a single-node homelab:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/dev/nvme0n1 (500GB NVMe)
  └── ZFS: rpool
      ├── rpool/ROOT/pve-1    # Proxmox OS
      └── rpool/data          # VM disks

/dev/sda (2TB SATA SSD) - optional
  └── ZFS: tank
      └── tank/backups        # Backup storage
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a production cluster:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/dev/nvme0n1 (256GB NVMe)
  └── ZFS: rpool              # OS only, small and fast

/dev/sda, /dev/sdb (2x 1TB SSD)
  └── ZFS mirror: vmpool
      └── vmpool/data         # VM disks with redundancy

/dev/sdc, /dev/sdd (2x 4TB HDD)
  └── ZFS mirror: backup
      └── backup/pbs          # Proxmox Backup Server storage
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Key principle: &lt;strong&gt;Separate OS from VM storage.&lt;/strong&gt; If your VM pool fills up, your host still boots.&lt;/p&gt;
&lt;h2&gt;Updates: The Boring Part That Matters&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;# Regular updates
apt update &amp;amp;&amp;amp; apt full-upgrade

# Check for kernel updates
pveversion -v

# Reboot if kernel updated
reboot
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before major version upgrades (7.x → 8.x):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read the official upgrade guide completely&lt;/li&gt;
&lt;li&gt;Backup everything&lt;/li&gt;
&lt;li&gt;Test on non-production first&lt;/li&gt;
&lt;li&gt;Run the upgrade checklist script&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;pve7to8 --full  # For 7→8 upgrade, shows potential issues
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;What I Wish I Knew&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;ZFS RAM usage.&lt;/strong&gt; ZFS wants RAM for ARC (adaptive replacement cache). Default is up to 50% of RAM. For a VM host, you might want to limit it:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# Limit ARC to 4GB
echo &quot;options zfs zfs_arc_max=4294967296&quot; &amp;gt; /etc/modprobe.d/zfs.conf
update-initramfs -u
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Enterprise vs no-subscription repo.&lt;/strong&gt; They&apos;re nearly identical. Enterprise gets updates slightly earlier, that&apos;s all. No-subscription is fine for production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clustering from the start.&lt;/strong&gt; If you might cluster later, plan for it now. Same network segment, unique hostnames, Corosync-compatible setup.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backups are separate.&lt;/strong&gt; Proxmox creates VMs. Proxmox Backup Server (PBS) backs them up. They&apos;re different products that work together. Plan your backup storage accordingly.&lt;/p&gt;
&lt;h2&gt;The Lesson&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;The most important thing isn&apos;t &apos;install&apos; — it&apos;s laying the foundation for upgrades.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A Proxmox install takes 10 minutes. The choices you make during those 10 minutes affect the next 5 years. Wrong disk layout? Reinstall. Wrong hostname? Pain when clustering. Wrong storage? Juggling VMs later.&lt;/p&gt;
&lt;p&gt;The boring install is the one that survives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ZFS for data integrity&lt;/li&gt;
&lt;li&gt;Static IP, proper FQDN&lt;/li&gt;
&lt;li&gt;Repository configured correctly&lt;/li&gt;
&lt;li&gt;Separate OS and VM storage&lt;/li&gt;
&lt;li&gt;Documentation of what you did&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Proxmox will upgrade through multiple major versions if you don&apos;t make weird choices at install time. That&apos;s the goal: a hypervisor you forget is there because it just works.&lt;/p&gt;
</content:encoded><category>proxmox</category><category>homelab</category><category>virtualization</category><author>berik@ashimov.com (Berik Ashimov)</author></item></channel></rss>