The spine/leaf cabling is done and the underlay pings. Now the actual job starts: turning a routed fabric into something that carries tenant L2 and L3 across racks. VXLAN with BGP EVPN is the standard answer, but the config touches five features that all have to agree, and a single mismatch leaves you with a fabric that looks healthy and forwards nothing.
This is the order I build it in, and the checks I run at each layer before moving up.
The Two Planes
VXLAN separates the transport from the service:
Overlay (EVPN) tenant MAC/IP reachability via BGP ─────────────────────────────────────────────── Underlay (OSPF) loopback reachability + ECMPThe underlay only needs to do one thing well: every VTEP loopback reachable from every other, with equal-cost paths to the spines. The overlay rides on top and never appears in the underlay routing table.
Underlay: OSPF and PIM
Enable features first. On NX-OS nothing works until the feature is on.
feature ospffeature pimfeature interface-vlanfeature vn-segment-vlan-basedfeature nv overlaynv overlay evpnfeature bgpPoint-to-point links to the spines, plus loopbacks for the BGP router-id and the VTEP source:
interface loopback0 description ROUTER-ID ip address 10.255.0.11/32 ip router ospf UNDERLAY area 0.0.0.0 ip pim sparse-mode
interface loopback1 description VTEP-NVE-SOURCE ip address 10.255.1.11/32 ip router ospf UNDERLAY area 0.0.0.0 ip pim sparse-mode
interface Ethernet1/1 description TO-SPINE-1 no switchport ip address 10.1.1.1/31 ip router ospf UNDERLAY area 0.0.0.0 ip pim sparse-mode mtu 9216Two details that cause silent failures:
- MTU. VXLAN adds 50 bytes. If the underlay is 1500, every full-size tenant frame drops. Set 9216 fabric-wide.
- Separate loopbacks. Keep router-id (lo0) and NVE source (lo1) distinct. With vPC, both leaves share an anycast lo1 address, and you do not want that bleeding into your BGP router-id.
Verify before going further:
show ip ospf neighborsshow ip route 10.255.1.12 # remote VTEP loopback, must be /32 via spinesshow ip pim rp mapping # if using multicast replicationIf a remote VTEP loopback is not in the table with ECMP next-hops, stop. The overlay cannot work.
Overlay: BGP EVPN
Spines are route reflectors; leaves are clients. The address family that matters is l2vpn evpn.
router bgp 65000 router-id 10.255.0.11 address-family l2vpn evpn retain route-target all neighbor 10.255.0.1 remote-as 65000 update-source loopback0 address-family l2vpn evpn send-community extended neighbor 10.255.0.2 remote-as 65000 update-source loopback0 address-family l2vpn evpn send-community extendedsend-community extended is not optional — EVPN route-targets travel as extended communities. Drop it and routes propagate but import nowhere.
Confirm the sessions came up in the right AFI:
show bgp l2vpn evpn summary# State/PfxRcd should show a number, not Idle/ActiveMapping VLANs to VXLAN
Each tenant VLAN gets a VNI. L2 VNIs carry MAC; an L3 VNI per VRF carries inter-subnet routing.
vlan 10 vn-segment 10010vlan 20 vn-segment 10020vlan 999 vn-segment 50999 # L3 VNI for tenant VRF
vrf context TENANT-A vni 50999 rd auto address-family ipv4 unicast route-target both auto evpnrd auto and route-target both auto evpn let NX-OS derive RD/RT from the BGP AS and VNI. It is consistent and saves a class of typo bugs — use it unless you have a specific reason to pin values.
The NVE Interface
This is the encapsulation engine. Ingress replication via BGP avoids needing multicast in the underlay, which is the simpler choice for most fabrics:
interface nve1 no shutdown host-reachability protocol bgp source-interface loopback1 member vni 10010 ingress-replication protocol bgp member vni 10020 ingress-replication protocol bgp member vni 50999 associate-vrfassociate-vrf on the L3 VNI is what makes inter-subnet routing work across the fabric. Forget it and L2 stretches fine but tenants cannot route between subnets.
Tie the VLANs into EVPN:
evpn vni 10010 l2 rd auto route-target import auto route-target export auto vni 10020 l2 rd auto route-target import auto route-target export autoDistributed Anycast Gateway
Every leaf is the default gateway for every subnet, using the same MAC everywhere. A VM keeps its gateway after a live migration to another rack — no ARP relearning, no traffic tromboning to one switch.
fabric forwarding anycast-gateway-mac 0000.2222.3333
interface Vlan10 no shutdown vrf member TENANT-A ip address 10.10.10.1/24 fabric forwarding mode anycast-gateway
interface Vlan20 no shutdown vrf member TENANT-A ip address 10.10.20.1/24 fabric forwarding mode anycast-gatewayThe same Vlan10 SVI with the same IP is configured on every leaf. That is intentional. The anycast-gateway-mac must be identical fabric-wide.
Proving It Forwards
Control plane up is not the same as data plane working. Walk it end to end.
# Are remote VTEPs discovered as NVE peers?show nve peers
# Is the local VTEP up with the right source?show nve interface nve1 detail
# Type-2 routes (MAC/IP) learned from other leaves?show bgp l2vpn evpn
# A specific host's MAC reachable, and via which VTEP?show l2route evpn mac allshow l2route evpn mac-ip all
# VXLAN-aware MAC table — note the remote VTEP in the "next-hop"show mac address-table dynamic vlan 10The single most useful check is show nve peers. If two leaves host the same VNI but do not appear as peers to each other, ingress replication has nothing to replicate to, and east-west traffic between them blackholes while local traffic looks fine.
For L3, confirm the tenant VRF sees remote subnets via the L3 VNI:
show ip route vrf TENANT-A# remote subnets show next-hop of the remote VTEP, %TENANT-A, via the L3 VNIshow forwarding vrf TENANT-A routeFailure Drills Worth Running Pre-Production
| Test | Expected | Common bug it catches |
|---|---|---|
| Shut one spine uplink | ECMP reroutes, no loss | Missing ip router ospf on a link |
| Reload a leaf | Hosts reconverge in seconds | NVE source on lo0 instead of lo1 |
| Move a host between leaves | MAC mobility, no duplicate | Missing anycast-gateway-mac match |
| Ping across subnets | Routed via L3 VNI | associate-vrf forgotten |
Run these with traffic flowing, not on a quiet fabric. The failures that matter only show up under load.
The Bug That Looks Like a Hardware Fault
The most time-consuming VXLAN failure I have chased was a fabric where two leaves hosted the same VNI, OSPF was full, BGP EVPN was up, Type-2 routes were present on both sides — and a host on Leaf 11 could not reach a host on Leaf 12. Local traffic worked. Everything looked healthy.
show nve peers is where it surfaced:
show nve peers# Interface Peer-IP State LearnType Uptime Router-Mac# nve1 10.255.1.12 Up CP 01:14:22 n/aThe peer was there, control-plane learned. But the MAC table told a different story:
show mac address-table dynamic vlan 10# the remote MAC pointed at peer-ip 10.255.1.99 — an address no leaf ownedThe cause was a duplicated NVE source loopback. Leaf 12 had been templated from Leaf 99 and its interface loopback1 still carried 10.255.1.99/32. OSPF advertised it, BGP next-hops resolved to it, and frames were VXLAN-encapsulated toward a destination that existed nowhere. The fix was one line, but the symptom — partial reachability with a clean control plane — screams hardware until you check the source addresses.
The lesson: when east-west blackholes but the control plane is green, audit show nve interface nve1 detail for the source IP on every leaf and confirm each one is unique. A duplicated VTEP source is invisible to every protocol-level check.
vPC and the Anycast VTEP
Two leaves running vPC must present a single VTEP to the fabric, or MAC moves between the vPC peers look like host flaps to the rest of the EVPN domain. The mechanism is a shared secondary IP on the NVE source loopback — the anycast VTEP address — advertised by both peers.
interface loopback1 description VTEP-NVE-SOURCE ip address 10.255.1.11/32 # unique primary ip address 10.255.1.100/32 secondary # shared anycast VTEP, identical on the vPC peer ip router ospf UNDERLAY area 0.0.0.0 ip pim sparse-mode
interface nve1 source-interface loopback1 source-interface hold-down-time 180Remote leaves see one VTEP (10.255.1.100) regardless of which vPC peer learned the MAC, so a host hashed to either peer is just “behind the anycast VTEP.” The source-interface hold-down-time keeps the NVE down after reload until the underlay and vPC are settled — bring the VTEP up too early and you advertise reachability before forwarding is ready, blackholing traffic for the few seconds it takes BGP to converge.
Verify both peers advertise the same secondary and that orphan hosts (single-homed to one peer) are reachable:
show nve interface nve1 detail | include "Source\|Anycast"show vpc # both peers vPC-consistentshow bgp l2vpn evpn 10.255.1.100 # both leaves originate the anycast VTEP routeWhat I Skip and Why
I do not start with multicast underlay replication. Ingress replication via BGP is one less protocol to operate, and for fabrics up to a few dozen leaves the replication cost is negligible. Multicast earns its place only at scale, and by then you have the EVPN fundamentals solid.
Get the underlay boringly reliable, bring up EVPN, then add VNIs one tenant at a time with show nve peers open in another window. A fabric built this way fails predictably, which is the only kind of failure you want.