On a cloud provider, type: LoadBalancer provisions a real load balancer and an external IP appears. On bare metal, that same Service sits in <pending> forever, because nothing is listening for the request. MetalLB fills that gap. In BGP mode it does it the way a network engineer would want: by speaking BGP to your routers and advertising the Service IPs, so the routers ECMP traffic across nodes.
L2 mode exists and is simpler, but it funnels all traffic for a Service through one node. BGP mode is the one worth running.
Two Modes, One Right Answer for Scale
| L2 mode | BGP mode | |
|---|---|---|
| How it works | One node ARPs for the IP | Nodes announce IP via BGP |
| Traffic path | All through the leader node | ECMP across nodes |
| Failover | ARP re-announce (seconds) | BGP withdraw (sub-second) |
| Needs | Nothing on the network | BGP-speaking routers |
| Scale | Limited by one node | Limited by your fabric |
L2 mode is fine for a homelab or a single small Service. BGP mode is what you run when throughput matters and you have routers that speak BGP — which, if you are reading this, you do.
Address Pools
MetalLB 0.13+ is configured with CRs, not a ConfigMap. First, the pool of IPs it is allowed to hand out:
apiVersion: metallb.io/v1beta1kind: IPAddressPoolmetadata: name: prod-pool namespace: metallb-systemspec: addresses: - 203.0.113.0/24 autoAssign: trueThese are the IPs Services will get. They must be routable to your fabric and not overlap with node or pod networks.
BGP Peers
Tell MetalLB which routers to peer with. Typically both top-of-rack switches, so the announcement reaches the fabric from every node:
apiVersion: metallb.io/v1beta1kind: BGPPeermetadata: name: tor1 namespace: metallb-systemspec: myASN: 65100 peerASN: 65000 peerAddress: 10.0.0.1 holdTime: 90s---apiVersion: metallb.io/v1beta1kind: BGPPeermetadata: name: tor2 namespace: metallb-systemspec: myASN: 65100 peerASN: 65000 peerAddress: 10.0.0.2Each node runs a BGP speaker that peers with the ToRs. When a Service gets an IP, every node hosting a healthy endpoint announces a /32 for it. The routers see the same /32 from multiple next-hops and install an ECMP route — instant per-flow load balancing across nodes, done by the fabric, not by Kubernetes.
The Advertisement
The pool and peers are joined by a BGPAdvertisement, which controls how the pool’s IPs are announced:
apiVersion: metallb.io/v1beta1kind: BGPAdvertisementmetadata: name: prod-adv namespace: metallb-systemspec: ipAddressPools: - prod-pool aggregationLength: 32 communities: - 65000:100 localPref: 100aggregationLength: 32 advertises individual /32s — required for ECMP and for clean failover, because a node losing its last endpoint withdraws exactly that /32. Communities and local-pref let your routers treat MetalLB routes with policy, same as any other BGP source.
externalTrafficPolicy: The Detail That Decides Behavior
This Service field interacts with BGP mode and matters more than anything else:
Cluster(default) — every node announces the Service IP, even nodes with no local pod. Traffic may hop node-to-node to reach a pod, which works but hides the client source IP behind SNAT and adds a hop.Local— only nodes with a running endpoint announce the IP. No extra hop, client source IP preserved, and the BGP withdraw on the last-pod-leaving is your health signal.
apiVersion: v1kind: Servicemetadata: name: webspec: type: LoadBalancer externalTrafficPolicy: Local selector: { app: web } ports: [{ port: 443, targetPort: 8443 }]For most production services, Local is the right choice: real client IPs and traffic only going to nodes that can serve it. The trade is that load balancing is now weighted by where the pods are scheduled — spread your pods across nodes or one node gets all the traffic for that Service.
Verification
# Did the Service get an IP from the pool?kubectl get svc web# EXTERNAL-IP should be from 203.0.113.0/24, not <pending>
# Are the BGP sessions up?kubectl logs -n metallb-system -l component=speaker | grep -i "BGP session"
# On the router: is the /32 there via multiple next-hops (ECMP)?show ip route 203.0.113.5# should list multiple node IPs as next-hops with externalTrafficPolicy: Cluster,# or only nodes running an endpoint with LocalThe router-side check is the proof. If the Service IP shows a single next-hop when you expected several, either only one node has an endpoint (Local doing its job) or the speakers on other nodes are not peering — check the speaker logs.
Where It Bites
- Pool not routable. The fabric must know to route the pool toward the cluster. The /32s from MetalLB handle the cluster side, but the rest of your network needs a path to that /24.
- ASN mismatch.
peerASN/myASNmust match what the router expects. iBGP vs eBGP changes next-hop behavior — usually you want the cluster in its own AS (eBGP to the ToRs). - FRR mode. Newer MetalLB can run an FRR-based BGP backend (
frr-k8s), which supports things the native speaker does not (BFD, receiving routes, richer policy). If you need BFD for sub-second failover, that is the backend to choose.
When a Node Dies: The Failover Drill
The promise of BGP mode is sub-second failover. Verify it before you depend on it. Pick a Service, find which nodes announce its /32, then kill one and watch the route converge.
# Which nodes currently announce 203.0.113.5?kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}'kubectl get endpointslices -l kubernetes.io/service-name=web \ -o jsonpath='{range .items[*].endpoints[*]}{.nodeName}{"\n"}{end}'On the router, confirm the ECMP set, then drain a node and watch a next-hop vanish:
# Router, beforeshow ip route 203.0.113.5# 203.0.113.5/32 via 10.0.0.21, via 10.0.0.22, via 10.0.0.23
# Cordon + drain one node to evict the endpointkubectl drain node-21 --ignore-daemonsets --delete-emptydir-data
# Router, after — that next-hop should be gone in well under a second with BFDshow ip route 203.0.113.5# 203.0.113.5/32 via 10.0.0.22, via 10.0.0.23Without BFD, the withdraw waits on BGP holdTime (90s above) if the node hard-fails instead of cleanly withdrawing — a powered-off node does not send a graceful BGP withdraw. That is the gap BFD closes. With the native speaker there is no BFD, so a hard node failure blackholes that Service’s share of flows until hold time expires. If sub-second matters on hardware failure (not just graceful drain), run the frr-k8s backend and enable BFD on the peer.
BFD on the FRR Backend
frr-k8s exposes BFD through the same CRD surface. Define a BFDProfile, then reference it from the peer:
apiVersion: metallb.io/v1beta1kind: BFDProfilemetadata: name: tor-bfd namespace: metallb-systemspec: receiveInterval: 300 transmitInterval: 300 detectMultiplier: 3---apiVersion: metallb.io/v1beta1kind: BGPPeermetadata: name: tor1 namespace: metallb-systemspec: myASN: 65100 peerASN: 65000 peerAddress: 10.0.0.1 bfdProfile: tor-bfd300ms × 3 detects a dead peer in under a second. The router needs a matching BFD config on its side — the session is bidirectional and will not come up half-configured. Verify from inside the speaker pod, which runs FRR:
kubectl exec -n metallb-system -it ds/speaker -c frr -- \ vtysh -c "show bfd peers brief"# SessionId LocalAddress PeerAddress Status# ... 10.0.0.21 10.0.0.1 upIf the session stays down while BGP itself is Established, it is almost always a one-sided BFD config or an ACL dropping UDP/3784 between the node and the ToR.
MetalLB in BGP mode turns type: LoadBalancer on bare metal into exactly what it is in the cloud — an external IP that just works — but with the bonus that the load balancing is done by your fabric’s ECMP, transparently, using the protocol your routers already speak.