CI/CD for Network Configs: Test and Rollback with GitLab

The way most network changes still happen: open a ticket, SSH into the box at a maintenance window, paste config, hope. There is no review, no test, and “rollback” means frantically retyping from memory while the link is down. We solved this for application code a decade ago with CI/CD. The same pipeline works for network config — and the payoff is even bigger, because a bad route affects everyone, not one service.

The model: config lives in git, changes go through merge requests, a pipeline validates them offline, and deployment is a gated, diffed, reversible step.

The Pipeline Shape

commit -> MR
render generate intended config from source-of-truth data
validate Batfish: offline correctness checks, no device touched
diff NAPALM dry-run: show what would change on each device
│ (manual approval gate)
deploy commit the change, capture pre-change backup
verify reachability / protocol state; rollback on failure

The crucial idea is that everything before deploy touches no device. You catch the broken ACL, the typo’d next-hop, the accidentally-withdrawn prefix, in CI — not on the production router.

Offline Validation with Batfish

Batfish builds a model of your network from the config text and answers questions about it without ever connecting to a device. It catches the failures that a syntax check misses — unreachable ACL lines, BGP sessions that will not come up, routes that will not resolve.

from pybatfish.client.session import Session
from pybatfish.datamodel.flow import HeaderConstraints, PathConstraints
bf = Session(host="batfish")
bf.init_snapshot("./configs", name="candidate", overwrite=True)
# Will any BGP session fail to establish?
bf.q.bgpSessionStatus().answer().frame()
# Are there ACL lines that can never match?
bf.q.filterLineReachability().answer().frame()
# Does a critical flow still get through after this change?
bf.q.reachability(
pathConstraints=PathConstraints(startLocation="leaf1"),
headers=HeaderConstraints(dstIps="10.20.0.0/24", dstPorts="443"),
actions="SUCCESS",
).answer().frame()

A reachability test that flips from SUCCESS to FAIL between the current and candidate snapshots is a change that breaks production traffic — and you found it in a pipeline, before touching anything.

The GitLab Pipeline

stages: [render, validate, diff, deploy, verify]
render:
stage: render
script:
- python render_from_netbox.py --out configs/
artifacts:
paths: [configs/]
validate:
stage: validate
script:
- python batfish_checks.py configs/ # fails job on any check regression
diff:
stage: diff
script:
- python napalm_diff.py configs/ # dry-run, prints per-device diff
artifacts:
paths: [diffs/]
deploy:
stage: deploy
when: manual # human approval gate
environment: production
script:
- python napalm_deploy.py configs/ --backup-dir backups/
verify:
stage: verify
needs: [deploy]
script:
- python verify_state.py || python rollback.py backups/

when: manual on deploy is the gate. The diffs are an artifact on the merge request, so a reviewer approves the actual change, not a description of it. Nothing reaches a device until a human clicks deploy with the diff in front of them.

Deploy and Backup Together

The deploy step’s first job is to capture the running config so rollback is real, not theoretical:

from napalm import get_network_driver
def deploy(host, driver, candidate, backup_dir):
d = get_network_driver(driver)(host, USER, PW)
d.open()
# backup BEFORE changing anything
running = d.get_config()["running"]
open(f"{backup_dir}/{host}.cfg", "w").write(running)
# load, diff once more, commit
d.load_replace_candidate(config=candidate)
diff = d.compare_config()
if diff:
d.commit_config()
else:
d.discard_config() # don't leave a candidate locked on the device
d.close()

load_replace_candidate (full replace) plus a captured backup means rollback is “load the backup and commit” — deterministic, not a human retyping under pressure.

Verify, and Roll Back on Failure

Deployment is not done when the commit succeeds. It is done when the network still works. The verify stage proves it:

# verify_state.py — exit non-zero to trigger rollback
checks = [
bgp_neighbors_established(expected=12),
interface_up("Ethernet1/1"),
ping("10.20.0.1", count=5, loss_max=0),
]
sys.exit(0 if all(checks) else 1)

If verify exits non-zero, the pipeline runs rollback.py, which reloads the backups captured in the deploy step. The window between “bad change committed” and “previous config restored” is the length of one pipeline stage, automatically, with no one paging the on-call to remember what the config used to be.

Junos Makes Part of This Free

On platforms with native candidate config and confirmed commit, lean on it as a second safety net:

Terminal window
commit confirmed 5

If the pipeline’s verify stage does not issue a final commit within five minutes, the device rolls back on its own. Even if your CI runner dies mid-deploy, the router un-breaks itself. Use both — pipeline rollback for the general case, commit confirmed for the “the automation itself failed” case.

Comparing Snapshots, Not Just Checking One

A single Batfish snapshot answers “is this config valid?” The more useful question is “did this change break something that worked before?” Batfish models a reference snapshot (current production) and a candidate snapshot (the MR) side by side, and the diff of their answers is the signal:

from pybatfish.client.session import Session
bf = Session(host="batfish")
bf.init_snapshot("./configs-current", name="reference", overwrite=True)
bf.init_snapshot("./configs-candidate", name="candidate", overwrite=True)
# Sessions that are up in reference but down in candidate = a regression this MR introduces
ref = bf.q.bgpSessionStatus().answer(snapshot="reference").frame()
cand = bf.q.bgpSessionStatus().answer(snapshot="candidate").frame()
# Routes that resolve now but won't after the change
bf.q.differentialReachability(
headers=HeaderConstraints(dstIps="10.20.0.0/24"),
).answer(snapshot="candidate", reference_snapshot="reference").frame()

differentialReachability is the one to wire into the gate: it returns exactly the flows whose fate changed between the two snapshots. An empty result means this MR does not alter who can reach what. A non-empty result is the list of conversations the reviewer must consciously sign off on. That reframes review from reading config lines to approving behavioral deltas.

Failure Drill: The Verify Stage Caught a Bad Commit

Walk the unhappy path end to end, because the unhappy path is the whole reason to build this. An MR retags a transit interface into the wrong VRF. Batfish passes — the config is syntactically valid and the VRF exists. NAPALM’s diff looks small and plausible. A reviewer clicks deploy.

deploy: config committed on leaf3, backup saved to backups/leaf3.cfg
verify: bgp_neighbors_established(expected=12) -> got 10 FAIL

Two sessions across that interface dropped because the neighbor is now in a different VRF. verify_state.py exits non-zero, the pipeline runs rollback.py, and the captured backup reloads:

rollback.py
from napalm import get_network_driver
import glob, os
for path in glob.glob(f"{BACKUP_DIR}/*.cfg"):
host = os.path.basename(path).removesuffix(".cfg")
d = get_network_driver("ios")(host, USER, PW)
d.open()
d.load_replace_candidate(filename=path) # full replace back to pre-change
if d.compare_config():
d.commit_config()
d.close()

Total exposure is one verify stage plus one rollback stage — call it two minutes — and no human had to remember what leaf3 looked like before. Contrast the SSH-and-paste version: the same mistake is discovered when a service team opens a ticket an hour later, and the fix is an engineer reconstructing the old config from memory.

One real gotcha: the verify stage must connect through a path the change cannot sever. If your runner reaches leaf3 only via the link the MR just broke, verify times out for the wrong reason and rollback may not reach the device either. Verify and roll back over out-of-band management, never the production data path.

What Changes Culturally

The technology is the easy part. The shift is that “I’ll just fix it real quick on the box” becomes a merge request. That feels slower for a one-line change — and it is, by a few minutes. But the one-line change that withdraws the wrong prefix and blackholes a /16 is exactly the change that “felt too small to review.” Put every change through the pipeline, including the small ones, because the small ones are the ones that cause the outages nobody saw coming.