MetalLB and BGP: LoadBalancer Services on Bare Metal

On a cloud provider, type: LoadBalancer provisions a real load balancer and an external IP appears. On bare metal, that same Service sits in <pending> forever, because nothing is listening for the request. MetalLB fills that gap. In BGP mode it does it the way a network engineer would want: by speaking BGP to your routers and advertising the Service IPs, so the routers ECMP traffic across nodes.

L2 mode exists and is simpler, but it funnels all traffic for a Service through one node. BGP mode is the one worth running.

Two Modes, One Right Answer for Scale

L2 modeBGP mode
How it worksOne node ARPs for the IPNodes announce IP via BGP
Traffic pathAll through the leader nodeECMP across nodes
FailoverARP re-announce (seconds)BGP withdraw (sub-second)
NeedsNothing on the networkBGP-speaking routers
ScaleLimited by one nodeLimited by your fabric

L2 mode is fine for a homelab or a single small Service. BGP mode is what you run when throughput matters and you have routers that speak BGP — which, if you are reading this, you do.

Address Pools

MetalLB 0.13+ is configured with CRs, not a ConfigMap. First, the pool of IPs it is allowed to hand out:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: prod-pool
namespace: metallb-system
spec:
addresses:
- 203.0.113.0/24
autoAssign: true

These are the IPs Services will get. They must be routable to your fabric and not overlap with node or pod networks.

BGP Peers

Tell MetalLB which routers to peer with. Typically both top-of-rack switches, so the announcement reaches the fabric from every node:

apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: tor1
namespace: metallb-system
spec:
myASN: 65100
peerASN: 65000
peerAddress: 10.0.0.1
holdTime: 90s
---
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: tor2
namespace: metallb-system
spec:
myASN: 65100
peerASN: 65000
peerAddress: 10.0.0.2

Each node runs a BGP speaker that peers with the ToRs. When a Service gets an IP, every node hosting a healthy endpoint announces a /32 for it. The routers see the same /32 from multiple next-hops and install an ECMP route — instant per-flow load balancing across nodes, done by the fabric, not by Kubernetes.

The Advertisement

The pool and peers are joined by a BGPAdvertisement, which controls how the pool’s IPs are announced:

apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: prod-adv
namespace: metallb-system
spec:
ipAddressPools:
- prod-pool
aggregationLength: 32
communities:
- 65000:100
localPref: 100

aggregationLength: 32 advertises individual /32s — required for ECMP and for clean failover, because a node losing its last endpoint withdraws exactly that /32. Communities and local-pref let your routers treat MetalLB routes with policy, same as any other BGP source.

externalTrafficPolicy: The Detail That Decides Behavior

This Service field interacts with BGP mode and matters more than anything else:

  • Cluster (default) — every node announces the Service IP, even nodes with no local pod. Traffic may hop node-to-node to reach a pod, which works but hides the client source IP behind SNAT and adds a hop.
  • Local — only nodes with a running endpoint announce the IP. No extra hop, client source IP preserved, and the BGP withdraw on the last-pod-leaving is your health signal.
apiVersion: v1
kind: Service
metadata:
name: web
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector: { app: web }
ports: [{ port: 443, targetPort: 8443 }]

For most production services, Local is the right choice: real client IPs and traffic only going to nodes that can serve it. The trade is that load balancing is now weighted by where the pods are scheduled — spread your pods across nodes or one node gets all the traffic for that Service.

Verification

Terminal window
# Did the Service get an IP from the pool?
kubectl get svc web
# EXTERNAL-IP should be from 203.0.113.0/24, not <pending>
# Are the BGP sessions up?
kubectl logs -n metallb-system -l component=speaker | grep -i "BGP session"
# On the router: is the /32 there via multiple next-hops (ECMP)?
show ip route 203.0.113.5
# should list multiple node IPs as next-hops with externalTrafficPolicy: Cluster,
# or only nodes running an endpoint with Local

The router-side check is the proof. If the Service IP shows a single next-hop when you expected several, either only one node has an endpoint (Local doing its job) or the speakers on other nodes are not peering — check the speaker logs.

Where It Bites

  • Pool not routable. The fabric must know to route the pool toward the cluster. The /32s from MetalLB handle the cluster side, but the rest of your network needs a path to that /24.
  • ASN mismatch. peerASN/myASN must match what the router expects. iBGP vs eBGP changes next-hop behavior — usually you want the cluster in its own AS (eBGP to the ToRs).
  • FRR mode. Newer MetalLB can run an FRR-based BGP backend (frr-k8s), which supports things the native speaker does not (BFD, receiving routes, richer policy). If you need BFD for sub-second failover, that is the backend to choose.

When a Node Dies: The Failover Drill

The promise of BGP mode is sub-second failover. Verify it before you depend on it. Pick a Service, find which nodes announce its /32, then kill one and watch the route converge.

Terminal window
# Which nodes currently announce 203.0.113.5?
kubectl get svc web -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
kubectl get endpointslices -l kubernetes.io/service-name=web \
-o jsonpath='{range .items[*].endpoints[*]}{.nodeName}{"\n"}{end}'

On the router, confirm the ECMP set, then drain a node and watch a next-hop vanish:

Terminal window
# Router, before
show ip route 203.0.113.5
# 203.0.113.5/32 via 10.0.0.21, via 10.0.0.22, via 10.0.0.23
# Cordon + drain one node to evict the endpoint
kubectl drain node-21 --ignore-daemonsets --delete-emptydir-data
# Router, after — that next-hop should be gone in well under a second with BFD
show ip route 203.0.113.5
# 203.0.113.5/32 via 10.0.0.22, via 10.0.0.23

Without BFD, the withdraw waits on BGP holdTime (90s above) if the node hard-fails instead of cleanly withdrawing — a powered-off node does not send a graceful BGP withdraw. That is the gap BFD closes. With the native speaker there is no BFD, so a hard node failure blackholes that Service’s share of flows until hold time expires. If sub-second matters on hardware failure (not just graceful drain), run the frr-k8s backend and enable BFD on the peer.

BFD on the FRR Backend

frr-k8s exposes BFD through the same CRD surface. Define a BFDProfile, then reference it from the peer:

apiVersion: metallb.io/v1beta1
kind: BFDProfile
metadata:
name: tor-bfd
namespace: metallb-system
spec:
receiveInterval: 300
transmitInterval: 300
detectMultiplier: 3
---
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: tor1
namespace: metallb-system
spec:
myASN: 65100
peerASN: 65000
peerAddress: 10.0.0.1
bfdProfile: tor-bfd

300ms × 3 detects a dead peer in under a second. The router needs a matching BFD config on its side — the session is bidirectional and will not come up half-configured. Verify from inside the speaker pod, which runs FRR:

Terminal window
kubectl exec -n metallb-system -it ds/speaker -c frr -- \
vtysh -c "show bfd peers brief"
# SessionId LocalAddress PeerAddress Status
# ... 10.0.0.21 10.0.0.1 up

If the session stays down while BGP itself is Established, it is almost always a one-sided BFD config or an ACL dropping UDP/3784 between the node and the ToR.

MetalLB in BGP mode turns type: LoadBalancer on bare metal into exactly what it is in the cloud — an external IP that just works — but with the bonus that the load balancing is done by your fabric’s ECMP, transparently, using the protocol your routers already speak.