High Availability: VRRP + State Sync (What You Can and Can't Do)

High availability sounds simple: two routers, one fails, the other takes over. Users don’t notice. Uptime maintained. Check the HA box and move on.

Reality is messier. VRRP can fail over an IP address in seconds. But what about active connections? NAT state? BGP sessions? Firewall sessions? Some of this can be synchronized. Some can’t. Some can, but with caveats that matter.

This is an honest guide to VyOS HA. What works, what doesn’t, and how to test so you find out before production does.

VRRP Basics

VRRP (Virtual Router Redundancy Protocol) provides a virtual IP (VIP) shared between two or more routers. One is master, others are backup. If the master fails, a backup takes the VIP.

Clients point to the VIP. They don’t care which physical router is answering.

Basic VRRP Configuration

Two VyOS routers: R1 (primary) and R2 (backup).

R1 (Primary):

Terminal window
configure
set interfaces ethernet eth0 address '10.0.0.2/24'
set interfaces ethernet eth0 description 'LAN'
set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '200'
set high-availability vrrp group LAN preempt
commit

R2 (Backup):

Terminal window
configure
set interfaces ethernet eth0 address '10.0.0.3/24'
set interfaces ethernet eth0 description 'LAN'
set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '100'
set high-availability vrrp group LAN preempt
commit

Key settings:

  • vrid: Virtual Router ID. Must match on both routers.
  • virtual-address: The shared IP clients use (10.0.0.1).
  • priority: Higher wins. R1 at 200 beats R2 at 100.
  • preempt: If R1 recovers, it reclaims master status.

Verify:

Terminal window
show vrrp

Multiple VRRP Groups

Most routers have multiple interfaces. Each needs its own VRRP group:

Terminal window
configure
# LAN side
set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '200'
# WAN side
set high-availability vrrp group WAN vrid '20'
set high-availability vrrp group WAN interface 'eth1'
set high-availability vrrp group WAN virtual-address '203.0.113.1/24'
set high-availability vrrp group WAN priority '200'
commit

Sync Groups: Fail Together

If LAN interface fails but WAN is fine, you want BOTH to fail over. Otherwise, traffic enters one router and tries to exit another — asymmetric routing disaster.

Terminal window
configure
set high-availability vrrp sync-group MAIN member 'LAN'
set high-availability vrrp sync-group MAIN member 'WAN'
commit

Now if either interface fails, both VRRP groups fail over together.

What VRRP Does NOT Do

VRRP fails over IP addresses. That’s it. It does NOT automatically:

  • Transfer active TCP connections
  • Sync NAT translation tables
  • Maintain firewall connection state
  • Preserve BGP sessions
  • Sync DHCP leases

For that, you need additional sync mechanisms.

Connection Tracking Sync (Conntrack)

VyOS can synchronize its connection tracking table between routers. This means established connections (TCP sessions, NAT translations) survive failover.

Conntrack Sync Configuration

On both routers:

Terminal window
configure
# Define sync interface (dedicated link between routers)
set service conntrack-sync interface 'eth2'
set service conntrack-sync failover-mechanism vrrp sync-group 'MAIN'
set service conntrack-sync accept-protocol 'tcp,udp,icmp'
# Optional: Exclude local traffic
set service conntrack-sync ignore-address ipv4-address '10.0.0.2'
set service conntrack-sync ignore-address ipv4-address '10.0.0.3'
commit

Requirements:

  • Dedicated interface between routers (eth2 in this example)
  • This interface should be direct (crossover) or on isolated VLAN
  • NOT through the same switch that might fail

What Conntrack Sync Does

  • Syncs NAT translation table (internal→external mappings)
  • Syncs connection states (ESTABLISHED, RELATED)
  • Allows TCP connections to survive failover (mostly)

What Conntrack Sync Does NOT Do

  • Guarantee zero packet loss during failover
  • Sync application-layer state
  • Help with stateless protocols beyond basic tracking

The “Mostly” Caveat

TCP connections can survive, but:

  1. Packets in flight are lost. During failover (typically 1-3 seconds), packets are dropped. TCP will retransmit, but there’s a gap.

  2. Sequence number issues. Sometimes the new master’s kernel disagrees about TCP sequence numbers. Connection may reset.

  3. Asymmetric routing. If return traffic goes to wrong router, connections break. Sync groups help, but network design matters.

Realistic expectation: Long-lived connections (SSH sessions, database connections) usually survive. Short requests during failover may fail and retry. Users experience a brief hiccup, not a disconnect.

What You CAN’T Sync

BGP Sessions

BGP sessions are between your router and peer. When you fail over:

  • New master has different source IP (its physical IP)
  • Peer sees different neighbor
  • BGP session must re-establish

This takes seconds to minutes depending on timers.

Mitigation: Use aggressive BGP timers, BFD, and accept that BGP convergence is part of failover time.

Terminal window
configure
# Aggressive keepalive (3s) and hold (9s)
set protocols bgp neighbor 198.51.100.1 timers keepalive '3'
set protocols bgp neighbor 198.51.100.1 timers holdtime '9'
# BFD for faster detection
set protocols bgp neighbor 198.51.100.1 bfd
commit

IPsec Tunnels

IPsec SAs are bound to specific IPs. On failover:

  • IKE SAs re-negotiate
  • Child SAs re-establish
  • Tunnel is down for seconds

Mitigation: Use DPD (Dead Peer Detection) with short intervals. Accept brief tunnel downtime.

Routing Protocol State

OSPF neighbor relationships, BGP tables — none of this syncs. The new master starts fresh:

  • OSPF: Neighbors detect failure via dead interval, then re-elect
  • BGP: Sessions reset, routes re-exchanged

Application Sessions

If you’re running services on VyOS (rare, but possible):

  • DHCP leases: Can sync with ISC DHCP failover, but VyOS config is separate
  • DNS cache: Lost
  • Any local state: Lost

Testing Failover

HA that isn’t tested is HA that doesn’t work. Test before production.

Test 1: Clean Failover

Terminal window
# On primary, simulate failure
sudo ip link set eth0 down
# Watch secondary take over
show vrrp
# Verify traffic flows
# From a client, ping the VIP continuously
ping 10.0.0.1
# Restore
sudo ip link set eth0 up

Test 2: Primary Recovery (Preemption)

Terminal window
# Ensure preempt is enabled
# Take down primary, let secondary take over
# Bring primary back up
# Verify primary reclaims master
show vrrp

Test 3: Connection Survival

Terminal window
# Start long-running connection through router
# SSH through the VIP to a server on the other side
ssh user@server-behind-router
# Fail over primary
# Check if SSH session survives
# It should pause briefly then continue

Test 4: Split Brain

What if the sync link fails but both routers are up?

Terminal window
# Disconnect eth2 (sync interface)
# Both routers think they're alone
# Both might become master = split brain
# VyOS should still function, but conntrack sync stops
# This is a failure mode to document, not prevent

Test 5: Dual Failure

What if both routers fail?

This isn’t HA’s job. HA handles single failures. Document that dual failure = outage and size your expectations.

VRRP Tuning

Default is 1 second. Faster detection = faster failover, but more network traffic.

Terminal window
configure
set high-availability vrrp group LAN advertise-interval '1'
commit

For sub-second failover, some use 0.1-0.5 seconds. Be careful — this is more sensitive to network jitter.

Preempt Delay

When primary recovers, don’t immediately preempt. Let it stabilize:

Terminal window
configure
set high-availability vrrp group LAN preempt-delay '30'
commit

Primary must be up for 30 seconds before reclaiming master. Prevents flapping.

Health Check Scripts

Fail over based on more than interface status:

Terminal window
configure
set high-availability vrrp group LAN health-check script '/config/scripts/check-uplink.sh'
set high-availability vrrp group LAN health-check interval '5'
set high-availability vrrp group LAN health-check failure-count '3'
commit

Example script (/config/scripts/check-uplink.sh):

#!/bin/bash
# Check if upstream is reachable
ping -c 1 -W 1 198.51.100.1 > /dev/null 2>&1
exit $?

If script returns non-zero 3 times, VRRP fails over.

Realistic HA Architecture

A production VyOS HA setup:

┌─────────────────┐
│ Internet │
└────────┬────────┘
┌──────────────┴──────────────┐
│ │
┌─────┴─────┐ ┌───────┴───────┐
│ R1 (Pri) │───sync link───│ R2 (Backup) │
│ eth1: WAN │ eth2 │ eth1: WAN │
│ eth0: LAN │ │ eth0: LAN │
└─────┬─────┘ └───────┬───────┘
│ VIP: 10.0.0.1 │
└──────────────┬──────────────┘
┌────────┴────────┐
│ LAN Switch │
└────────┬────────┘
┌────────┴────────┐
│ Clients │
└─────────────────┘

Key points:

  • Dedicated sync link (eth2) — not through the LAN switch
  • Both routers connect to same LAN switch (single point of failure, but usually acceptable)
  • VIP is what clients use
  • If the LAN switch fails, both routers are useless anyway

The Lesson

HA is not a checkbox. It’s a set of failure scenarios and tests.

VRRP gives you IP failover in seconds. Conntrack sync gives you connection state (mostly). But:

  • BGP sessions reset
  • IPsec tunnels re-establish
  • Application state is lost
  • Brief packet loss happens

HA means: single router failure doesn’t cause outage. It doesn’t mean zero impact. Users may see a brief hiccup. Long connections survive but might stutter. This is acceptable for most use cases.

What makes HA work isn’t the configuration — it’s the testing. Every failure scenario you test is one you understand. Every one you skip is one that surprises you at 3 AM.

Document your failure modes. Test your failover. Know exactly what happens when the primary dies. That’s HA.