NX-OS VXLAN/EVPN Fabric: Underlay and Overlay End to End

The spine/leaf cabling is done and the underlay pings. Now the actual job starts: turning a routed fabric into something that carries tenant L2 and L3 across racks. VXLAN with BGP EVPN is the standard answer, but the config touches five features that all have to agree, and a single mismatch leaves you with a fabric that looks healthy and forwards nothing.

This is the order I build it in, and the checks I run at each layer before moving up.

The Two Planes

VXLAN separates the transport from the service:

Overlay (EVPN) tenant MAC/IP reachability via BGP
───────────────────────────────────────────────
Underlay (OSPF) loopback reachability + ECMP

The underlay only needs to do one thing well: every VTEP loopback reachable from every other, with equal-cost paths to the spines. The overlay rides on top and never appears in the underlay routing table.

Underlay: OSPF and PIM

Enable features first. On NX-OS nothing works until the feature is on.

Terminal window
feature ospf
feature pim
feature interface-vlan
feature vn-segment-vlan-based
feature nv overlay
nv overlay evpn
feature bgp

Point-to-point links to the spines, plus loopbacks for the BGP router-id and the VTEP source:

Terminal window
interface loopback0
description ROUTER-ID
ip address 10.255.0.11/32
ip router ospf UNDERLAY area 0.0.0.0
ip pim sparse-mode
interface loopback1
description VTEP-NVE-SOURCE
ip address 10.255.1.11/32
ip router ospf UNDERLAY area 0.0.0.0
ip pim sparse-mode
interface Ethernet1/1
description TO-SPINE-1
no switchport
ip address 10.1.1.1/31
ip router ospf UNDERLAY area 0.0.0.0
ip pim sparse-mode
mtu 9216

Two details that cause silent failures:

  • MTU. VXLAN adds 50 bytes. If the underlay is 1500, every full-size tenant frame drops. Set 9216 fabric-wide.
  • Separate loopbacks. Keep router-id (lo0) and NVE source (lo1) distinct. With vPC, both leaves share an anycast lo1 address, and you do not want that bleeding into your BGP router-id.

Verify before going further:

Terminal window
show ip ospf neighbors
show ip route 10.255.1.12 # remote VTEP loopback, must be /32 via spines
show ip pim rp mapping # if using multicast replication

If a remote VTEP loopback is not in the table with ECMP next-hops, stop. The overlay cannot work.

Overlay: BGP EVPN

Spines are route reflectors; leaves are clients. The address family that matters is l2vpn evpn.

Terminal window
router bgp 65000
router-id 10.255.0.11
address-family l2vpn evpn
retain route-target all
neighbor 10.255.0.1
remote-as 65000
update-source loopback0
address-family l2vpn evpn
send-community extended
neighbor 10.255.0.2
remote-as 65000
update-source loopback0
address-family l2vpn evpn
send-community extended

send-community extended is not optional — EVPN route-targets travel as extended communities. Drop it and routes propagate but import nowhere.

Confirm the sessions came up in the right AFI:

Terminal window
show bgp l2vpn evpn summary
# State/PfxRcd should show a number, not Idle/Active

Mapping VLANs to VXLAN

Each tenant VLAN gets a VNI. L2 VNIs carry MAC; an L3 VNI per VRF carries inter-subnet routing.

Terminal window
vlan 10
vn-segment 10010
vlan 20
vn-segment 10020
vlan 999
vn-segment 50999 # L3 VNI for tenant VRF
vrf context TENANT-A
vni 50999
rd auto
address-family ipv4 unicast
route-target both auto evpn

rd auto and route-target both auto evpn let NX-OS derive RD/RT from the BGP AS and VNI. It is consistent and saves a class of typo bugs — use it unless you have a specific reason to pin values.

The NVE Interface

This is the encapsulation engine. Ingress replication via BGP avoids needing multicast in the underlay, which is the simpler choice for most fabrics:

Terminal window
interface nve1
no shutdown
host-reachability protocol bgp
source-interface loopback1
member vni 10010
ingress-replication protocol bgp
member vni 10020
ingress-replication protocol bgp
member vni 50999
associate-vrf

associate-vrf on the L3 VNI is what makes inter-subnet routing work across the fabric. Forget it and L2 stretches fine but tenants cannot route between subnets.

Tie the VLANs into EVPN:

Terminal window
evpn
vni 10010 l2
rd auto
route-target import auto
route-target export auto
vni 10020 l2
rd auto
route-target import auto
route-target export auto

Distributed Anycast Gateway

Every leaf is the default gateway for every subnet, using the same MAC everywhere. A VM keeps its gateway after a live migration to another rack — no ARP relearning, no traffic tromboning to one switch.

Terminal window
fabric forwarding anycast-gateway-mac 0000.2222.3333
interface Vlan10
no shutdown
vrf member TENANT-A
ip address 10.10.10.1/24
fabric forwarding mode anycast-gateway
interface Vlan20
no shutdown
vrf member TENANT-A
ip address 10.10.20.1/24
fabric forwarding mode anycast-gateway

The same Vlan10 SVI with the same IP is configured on every leaf. That is intentional. The anycast-gateway-mac must be identical fabric-wide.

Proving It Forwards

Control plane up is not the same as data plane working. Walk it end to end.

Terminal window
# Are remote VTEPs discovered as NVE peers?
show nve peers
# Is the local VTEP up with the right source?
show nve interface nve1 detail
# Type-2 routes (MAC/IP) learned from other leaves?
show bgp l2vpn evpn
# A specific host's MAC reachable, and via which VTEP?
show l2route evpn mac all
show l2route evpn mac-ip all
# VXLAN-aware MAC table — note the remote VTEP in the "next-hop"
show mac address-table dynamic vlan 10

The single most useful check is show nve peers. If two leaves host the same VNI but do not appear as peers to each other, ingress replication has nothing to replicate to, and east-west traffic between them blackholes while local traffic looks fine.

For L3, confirm the tenant VRF sees remote subnets via the L3 VNI:

Terminal window
show ip route vrf TENANT-A
# remote subnets show next-hop of the remote VTEP, %TENANT-A, via the L3 VNI
show forwarding vrf TENANT-A route

Failure Drills Worth Running Pre-Production

TestExpectedCommon bug it catches
Shut one spine uplinkECMP reroutes, no lossMissing ip router ospf on a link
Reload a leafHosts reconverge in secondsNVE source on lo0 instead of lo1
Move a host between leavesMAC mobility, no duplicateMissing anycast-gateway-mac match
Ping across subnetsRouted via L3 VNIassociate-vrf forgotten

Run these with traffic flowing, not on a quiet fabric. The failures that matter only show up under load.

The Bug That Looks Like a Hardware Fault

The most time-consuming VXLAN failure I have chased was a fabric where two leaves hosted the same VNI, OSPF was full, BGP EVPN was up, Type-2 routes were present on both sides — and a host on Leaf 11 could not reach a host on Leaf 12. Local traffic worked. Everything looked healthy.

show nve peers is where it surfaced:

Terminal window
show nve peers
# Interface Peer-IP State LearnType Uptime Router-Mac
# nve1 10.255.1.12 Up CP 01:14:22 n/a

The peer was there, control-plane learned. But the MAC table told a different story:

Terminal window
show mac address-table dynamic vlan 10
# the remote MAC pointed at peer-ip 10.255.1.99 — an address no leaf owned

The cause was a duplicated NVE source loopback. Leaf 12 had been templated from Leaf 99 and its interface loopback1 still carried 10.255.1.99/32. OSPF advertised it, BGP next-hops resolved to it, and frames were VXLAN-encapsulated toward a destination that existed nowhere. The fix was one line, but the symptom — partial reachability with a clean control plane — screams hardware until you check the source addresses.

The lesson: when east-west blackholes but the control plane is green, audit show nve interface nve1 detail for the source IP on every leaf and confirm each one is unique. A duplicated VTEP source is invisible to every protocol-level check.

vPC and the Anycast VTEP

Two leaves running vPC must present a single VTEP to the fabric, or MAC moves between the vPC peers look like host flaps to the rest of the EVPN domain. The mechanism is a shared secondary IP on the NVE source loopback — the anycast VTEP address — advertised by both peers.

Terminal window
interface loopback1
description VTEP-NVE-SOURCE
ip address 10.255.1.11/32 # unique primary
ip address 10.255.1.100/32 secondary # shared anycast VTEP, identical on the vPC peer
ip router ospf UNDERLAY area 0.0.0.0
ip pim sparse-mode
interface nve1
source-interface loopback1
source-interface hold-down-time 180

Remote leaves see one VTEP (10.255.1.100) regardless of which vPC peer learned the MAC, so a host hashed to either peer is just “behind the anycast VTEP.” The source-interface hold-down-time keeps the NVE down after reload until the underlay and vPC are settled — bring the VTEP up too early and you advertise reachability before forwarding is ready, blackholing traffic for the few seconds it takes BGP to converge.

Verify both peers advertise the same secondary and that orphan hosts (single-homed to one peer) are reachable:

Terminal window
show nve interface nve1 detail | include "Source\|Anycast"
show vpc # both peers vPC-consistent
show bgp l2vpn evpn 10.255.1.100 # both leaves originate the anycast VTEP route

What I Skip and Why

I do not start with multicast underlay replication. Ingress replication via BGP is one less protocol to operate, and for fabrics up to a few dozen leaves the replication cost is negligible. Multicast earns its place only at scale, and by then you have the EVPN fundamentals solid.

Get the underlay boringly reliable, bring up EVPN, then add VNIs one tenant at a time with show nve peers open in another window. A fabric built this way fails predictably, which is the only kind of failure you want.