High Availability: VRRP + State Sync (What You Can and Can't Do)

November 14, 2025 · 8 min read

High availability sounds simple: two routers, one fails, the other takes over. Users don’t notice. Uptime maintained. Check the HA box and move on.

Reality is messier. VRRP can fail over an IP address in seconds. But what about active connections? NAT state? BGP sessions? Firewall sessions? Some of this can be synchronized. Some can’t. Some can, but with caveats that matter.

This is an honest guide to VyOS HA. What works, what doesn’t, and how to test so you find out before production does.

VRRP Basics

VRRP (Virtual Router Redundancy Protocol) provides a virtual IP (VIP) shared between two or more routers. One is master, others are backup. If the master fails, a backup takes the VIP.

Clients point to the VIP. They don’t care which physical router is answering.

Basic VRRP Configuration

Two VyOS routers: R1 (primary) and R2 (backup).

R1 (Primary):

configure

set interfaces ethernet eth0 address '10.0.0.2/24'
set interfaces ethernet eth0 description 'LAN'

set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '200'
set high-availability vrrp group LAN preempt

commit

R2 (Backup):

configure

set interfaces ethernet eth0 address '10.0.0.3/24'
set interfaces ethernet eth0 description 'LAN'

set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '100'
set high-availability vrrp group LAN preempt

commit

Key settings:

vrid: Virtual Router ID. Must match on both routers.
virtual-address: The shared IP clients use (10.0.0.1).
priority: Higher wins. R1 at 200 beats R2 at 100.
preempt: If R1 recovers, it reclaims master status.

Verify:

show vrrp

Multiple VRRP Groups

Most routers have multiple interfaces. Each needs its own VRRP group:

configure

# LAN side
set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '200'

# WAN side
set high-availability vrrp group WAN vrid '20'
set high-availability vrrp group WAN interface 'eth1'
set high-availability vrrp group WAN virtual-address '203.0.113.1/24'
set high-availability vrrp group WAN priority '200'

commit

Sync Groups: Fail Together

If LAN interface fails but WAN is fine, you want BOTH to fail over. Otherwise, traffic enters one router and tries to exit another — asymmetric routing disaster.

configure

set high-availability vrrp sync-group MAIN member 'LAN'
set high-availability vrrp sync-group MAIN member 'WAN'

commit

Now if either interface fails, both VRRP groups fail over together.

What VRRP Does NOT Do

VRRP fails over IP addresses. That’s it. It does NOT automatically:

Transfer active TCP connections
Sync NAT translation tables
Maintain firewall connection state
Preserve BGP sessions
Sync DHCP leases

For that, you need additional sync mechanisms.

Connection Tracking Sync (Conntrack)

VyOS can synchronize its connection tracking table between routers. This means established connections (TCP sessions, NAT translations) survive failover.

Conntrack Sync Configuration

On both routers:

configure

# Define sync interface (dedicated link between routers)
set service conntrack-sync interface 'eth2'
set service conntrack-sync failover-mechanism vrrp sync-group 'MAIN'
set service conntrack-sync accept-protocol 'tcp,udp,icmp'

# Optional: Exclude local traffic
set service conntrack-sync ignore-address ipv4-address '10.0.0.2'
set service conntrack-sync ignore-address ipv4-address '10.0.0.3'

commit

Requirements:

Dedicated interface between routers (eth2 in this example)
This interface should be direct (crossover) or on isolated VLAN
NOT through the same switch that might fail

What Conntrack Sync Does

Syncs NAT translation table (internal→external mappings)
Syncs connection states (ESTABLISHED, RELATED)
Allows TCP connections to survive failover (mostly)

What Conntrack Sync Does NOT Do

Guarantee zero packet loss during failover
Sync application-layer state
Help with stateless protocols beyond basic tracking

The “Mostly” Caveat

TCP connections can survive, but:

Packets in flight are lost. During failover (typically 1-3 seconds), packets are dropped. TCP will retransmit, but there’s a gap.
Sequence number issues. Sometimes the new master’s kernel disagrees about TCP sequence numbers. Connection may reset.
Asymmetric routing. If return traffic goes to wrong router, connections break. Sync groups help, but network design matters.

Realistic expectation: Long-lived connections (SSH sessions, database connections) usually survive. Short requests during failover may fail and retry. Users experience a brief hiccup, not a disconnect.

What You CAN’T Sync

BGP Sessions

BGP sessions are between your router and peer. When you fail over:

New master has different source IP (its physical IP)
Peer sees different neighbor
BGP session must re-establish

This takes seconds to minutes depending on timers.

Mitigation: Use aggressive BGP timers, BFD, and accept that BGP convergence is part of failover time.

configure

# Aggressive keepalive (3s) and hold (9s)
set protocols bgp neighbor 198.51.100.1 timers keepalive '3'
set protocols bgp neighbor 198.51.100.1 timers holdtime '9'

# BFD for faster detection
set protocols bgp neighbor 198.51.100.1 bfd

commit

IPsec Tunnels

IPsec SAs are bound to specific IPs. On failover:

IKE SAs re-negotiate
Child SAs re-establish
Tunnel is down for seconds

Mitigation: Use DPD (Dead Peer Detection) with short intervals. Accept brief tunnel downtime.

Routing Protocol State

OSPF neighbor relationships, BGP tables — none of this syncs. The new master starts fresh:

OSPF: Neighbors detect failure via dead interval, then re-elect
BGP: Sessions reset, routes re-exchanged

Application Sessions

If you’re running services on VyOS (rare, but possible):

DHCP leases: Can sync with ISC DHCP failover, but VyOS config is separate
DNS cache: Lost
Any local state: Lost

Testing Failover

HA that isn’t tested is HA that doesn’t work. Test before production.

Test 1: Clean Failover

# On primary, simulate failure
sudo ip link set eth0 down

# Watch secondary take over
show vrrp

# Verify traffic flows
# From a client, ping the VIP continuously
ping 10.0.0.1

# Restore
sudo ip link set eth0 up

Test 2: Primary Recovery (Preemption)

# Ensure preempt is enabled
# Take down primary, let secondary take over
# Bring primary back up
# Verify primary reclaims master
show vrrp

Test 3: Connection Survival

# Start long-running connection through router
# SSH through the VIP to a server on the other side
ssh user@server-behind-router

# Fail over primary
# Check if SSH session survives
# It should pause briefly then continue

Test 4: Split Brain

What if the sync link fails but both routers are up?

# Disconnect eth2 (sync interface)
# Both routers think they're alone
# Both might become master = split brain

# VyOS should still function, but conntrack sync stops
# This is a failure mode to document, not prevent

Test 5: Dual Failure

What if both routers fail?

This isn’t HA’s job. HA handles single failures. Document that dual failure = outage and size your expectations.

VRRP Tuning

Advertisement Interval

Default is 1 second. Faster detection = faster failover, but more network traffic.

configure

set high-availability vrrp group LAN advertise-interval '1'

commit

For sub-second failover, some use 0.1-0.5 seconds. Be careful — this is more sensitive to network jitter.

Preempt Delay

When primary recovers, don’t immediately preempt. Let it stabilize:

configure

set high-availability vrrp group LAN preempt-delay '30'

commit

Primary must be up for 30 seconds before reclaiming master. Prevents flapping.

Health Check Scripts

Fail over based on more than interface status:

configure

set high-availability vrrp group LAN health-check script '/config/scripts/check-uplink.sh'
set high-availability vrrp group LAN health-check interval '5'
set high-availability vrrp group LAN health-check failure-count '3'

commit

Example script (/config/scripts/check-uplink.sh):

#!/bin/bash
# Check if upstream is reachable
ping -c 1 -W 1 198.51.100.1 > /dev/null 2>&1
exit $?

If script returns non-zero 3 times, VRRP fails over.

Realistic HA Architecture

A production VyOS HA setup:

                    ┌─────────────────┐
                    │    Internet     │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
        ┌─────┴─────┐               ┌───────┴───────┐
        │  R1 (Pri) │───sync link───│  R2 (Backup)  │
        │ eth1: WAN │     eth2      │  eth1: WAN    │
        │ eth0: LAN │               │  eth0: LAN    │
        └─────┬─────┘               └───────┬───────┘
              │ VIP: 10.0.0.1               │
              └──────────────┬──────────────┘
                             │
                    ┌────────┴────────┐
                    │   LAN Switch    │
                    └────────┬────────┘
                             │
                    ┌────────┴────────┐
                    │     Clients     │
                    └─────────────────┘

Key points:

Dedicated sync link (eth2) — not through the LAN switch
Both routers connect to same LAN switch (single point of failure, but usually acceptable)
VIP is what clients use
If the LAN switch fails, both routers are useless anyway

The Lesson

HA is not a checkbox. It’s a set of failure scenarios and tests.

VRRP gives you IP failover in seconds. Conntrack sync gives you connection state (mostly). But:

BGP sessions reset
IPsec tunnels re-establish
Application state is lost
Brief packet loss happens

HA means: single router failure doesn’t cause outage. It doesn’t mean zero impact. Users may see a brief hiccup. Long connections survive but might stutter. This is acceptable for most use cases.

What makes HA work isn’t the configuration — it’s the testing. Every failure scenario you test is one you understand. Every one you skip is one that surprises you at 3 AM.

Document your failure modes. Test your failover. Know exactly what happens when the primary dies. That’s HA.