WireGuard Mesh at Scale: Routing, NAT Traversal, and Failover

Connecting two hosts with WireGuard takes ten minutes: generate keys, set an endpoint, define AllowedIPs, done. Connecting forty hosts in a full mesh is not “do that twenty times” — it is a different problem, where the static config that worked for a pair becomes an unmaintainable matrix and AllowedIPs quietly becomes your routing protocol.

WireGuard scales beautifully at the dataplane; the work is all in how you manage keys, addresses, and routes across many nodes.

AllowedIPs Is the Routing Table

The single most misunderstood field. On a peer entry, AllowedIPs does two jobs:

  • Inbound: which source addresses are permitted to arrive from this peer (cryptokey routing — packets from an unlisted source are dropped).
  • Outbound: which destination addresses are routed to this peer.

So in a mesh, each peer’s AllowedIPs is the set of networks reachable through it. Get it wrong and you have not “misconfigured a firewall rule” — you have a routing black hole, because WireGuard decides the next-hop peer purely from longest-prefix match across all peers’ AllowedIPs.

A node in a mesh, statically, looks like:

[Interface]
Address = 10.10.0.1/32
PrivateKey = <this node's private key>
ListenPort = 51820
[Peer] # node 2
PublicKey = <node2 pubkey>
Endpoint = node2.example.net:51820
AllowedIPs = 10.10.0.2/32, 10.20.2.0/24
PersistentKeepalive = 25
[Peer] # node 3
PublicKey = <node3 pubkey>
Endpoint = 198.51.100.3:51820
AllowedIPs = 10.10.0.3/32, 10.20.3.0/24
PersistentKeepalive = 25

The n-Squared Problem

A full mesh of N nodes has N×(N−1)/2 tunnels and every node needs every other node’s public key and endpoint. Forty nodes is 780 tunnels and a config file per node listing 39 peers. Maintaining that by hand is hopeless — one rotated key means editing 39 files.

Two ways out:

  1. Generate the config. Treat the mesh as data — a list of nodes with keys, addresses, and endpoints — and template every node’s config from it. This is the pragmatic answer for static, known fleets. A small script (or Ansible/Nornir from your source of truth) renders all N configs from one inventory, and key rotation is a re-render.

  2. Hub-and-spoke, not full mesh. If most traffic is node-to-central rather than node-to-node, you do not need a full mesh. Spokes peer only with hubs; the hubs carry transit. Far fewer tunnels, far less config, at the cost of an extra hop for spoke-to-spoke.

Most “I need a mesh” cases are actually hub-and-spoke with a handful of direct shortcuts. Build the topology the traffic needs, not the maximal one.

NAT Traversal With Keepalive

A node behind NAT has no stable inbound endpoint — the NAT mapping only exists while traffic flows. PersistentKeepalive = 25 makes the node send a tiny packet every 25 seconds, holding the NAT mapping open so peers can reach it. Without it, a NATed node is reachable only right after it initiates, then goes dark when the mapping times out.

Rule of thumb: any peer that sits behind NAT needs PersistentKeepalive. Peers with public, stable endpoints do not (but it does no harm). In a mesh where you do not know who is behind NAT, set it everywhere.

You only need Endpoint for peers you must initiate to. A NATed node can omit endpoints for its public peers and let keepalive + their reachability do the work — but at least one side of every tunnel needs a reachable endpoint, or neither can start the handshake.

Routing Over the Mesh Instead of Static AllowedIPs

Static AllowedIPs is fine until a node’s reachable networks change, or you want failover between paths. At that point, stop encoding routes in AllowedIPs and run a real routing protocol over the tunnels.

The pattern: set each peer’s AllowedIPs wide enough to permit the routing protocol and the networks it might advertise (often a summary, e.g. 10.0.0.0/8), then let FRR run OSPF or BGP across the WireGuard interface:

Terminal window
# FRR over wg0 — OSPF discovers reachability dynamically
router ospf
network 10.10.0.0/24 area 0 # the mesh transit subnet
passive-interface default
no passive-interface wg0

Now a node advertising a new subnet, or a path going down, is handled by OSPF reconvergence — not by editing AllowedIPs on every other node. This is the difference between a static VPN and a self-healing overlay. The catch: AllowedIPs must be permissive enough to carry the dynamic routes (cryptokey routing still gates what is allowed), so you trade tight per-route filtering for dynamic flexibility. For a trusted internal mesh that is the right trade.

Failover

With static config there is no failover — a peer is up or its routes are dead. Two ways to get resilience:

  • Routing protocol (above): multiple paths, reconvergence on failure. The clean answer.
  • Multiple peers to the same destination with a routing protocol choosing among them — e.g., two hubs, OSPF cost picking the primary, failing to the secondary when the primary’s adjacency drops.

WireGuard itself has no concept of “tunnel down” beyond the handshake timing out; the routing layer on top is what turns a dead peer into a rerouted path.

Verifying

Terminal window
# Per-peer state: last handshake, transfer, endpoint
wg show
# A peer with no recent handshake is unreachable — check endpoint/keepalive/NAT
wg show wg0 latest-handshakes
# Is the route to a destination via the expected peer?
ip route get 10.20.3.5
# Routing protocol adjacencies over the mesh
vtysh -c "show ip ospf neighbor"

wg show last-handshake is the first thing to read. A handshake older than a couple of minutes on a keepalive peer means the tunnel is effectively down — usually a NAT mapping that expired (missing keepalive) or an endpoint that changed.

The AllowedIPs Overlap Trap

The longest-prefix-match behavior bites hardest when two peers claim overlapping ranges. WireGuard does not warn you — it silently routes a destination to whichever peer has the most specific match, and a tie or an unintended overlap sends traffic to the wrong tunnel or none at all.

Say two peers both list a summary route:

[Peer] # hub-a
PublicKey = <hub-a pubkey>
AllowedIPs = 10.20.0.0/16
[Peer] # hub-b — overlaps hub-a
PublicKey = <hub-b pubkey>
AllowedIPs = 10.20.0.0/16

WireGuard cannot install the same prefix toward two peers on one interface — the second wg set wins and the first peer silently loses that route. You will see it as “traffic to 10.20.5.5 only works sometimes” depending on config order. The fix is either non-overlapping prefixes per peer, or — the real answer at scale — stop putting summaries in AllowedIPs at all and let the routing protocol own reachability, keeping AllowedIPs as a permissive crypto gate only.

Confirm which peer actually owns a destination before debugging anything else:

Terminal window
# Which peer's AllowedIPs matched? wg show maps the route to a public key.
wg show wg0 allowed-ips
# Then confirm the kernel agrees on the next-hop interface
ip route get 10.20.5.5

Key Rotation Without Dropping the Mesh

Rotating a node’s key is where the generate-from-source-of-truth approach pays off, but the order matters — change the key in the wrong sequence and you cut the node off mid-rotation. WireGuard has no key-rollover handshake; a new private key means every peer must learn the new public key.

The safe sequence, driven from your inventory render:

Terminal window
# 1. Generate the new keypair for the rotating node
wg genkey | tee node7.key | wg pubkey > node7.pub
# 2. Push the NEW public key to every PEER first (peers tolerate an
# unused peer entry — the node just hasn't switched yet)
# Re-render and apply peer configs across the fleet.
# 3. Only then switch the node itself to the new private key
wg set wg0 private-key /etc/wireguard/node7.key
# 4. Remove the OLD public key from peers on the next render

Pushing the new public key to peers before the node switches means the handshake succeeds the instant the node flips — no window where peers reject it. Verify the rotated node re-handshakes everywhere:

Terminal window
wg show wg0 latest-handshakes
# Every peer should show a handshake within the last ~2 minutes post-rotation

If you run a routing protocol over the mesh, the adjacency re-forms on its own once the tunnel handshakes; with static AllowedIPs nothing else needs touching since the key change does not alter the route table.

The Takeaway

The dataplane is the easy part — WireGuard is fast and the crypto is not your problem. The engineering is in treating the mesh as managed infrastructure: generate configs from a source of truth so key rotation is a re-render, use keepalives wherever NAT is in play, and run a routing protocol over the tunnels once AllowedIPs stops being something you can maintain by hand. Build the topology your traffic actually needs, and let routing — not a hand-edited matrix of peers — decide the paths.