The way most network changes still happen: open a ticket, SSH into the box at a maintenance window, paste config, hope. There is no review, no test, and “rollback” means frantically retyping from memory while the link is down. We solved this for application code a decade ago with CI/CD. The same pipeline works for network config — and the payoff is even bigger, because a bad route affects everyone, not one service.
The model: config lives in git, changes go through merge requests, a pipeline validates them offline, and deployment is a gated, diffed, reversible step.
The Pipeline Shape
commit -> MR │ render generate intended config from source-of-truth data │ validate Batfish: offline correctness checks, no device touched │ diff NAPALM dry-run: show what would change on each device │ (manual approval gate) deploy commit the change, capture pre-change backup │ verify reachability / protocol state; rollback on failureThe crucial idea is that everything before deploy touches no device. You catch the broken ACL, the typo’d next-hop, the accidentally-withdrawn prefix, in CI — not on the production router.
Offline Validation with Batfish
Batfish builds a model of your network from the config text and answers questions about it without ever connecting to a device. It catches the failures that a syntax check misses — unreachable ACL lines, BGP sessions that will not come up, routes that will not resolve.
from pybatfish.client.session import Sessionfrom pybatfish.datamodel.flow import HeaderConstraints, PathConstraints
bf = Session(host="batfish")bf.init_snapshot("./configs", name="candidate", overwrite=True)
# Will any BGP session fail to establish?bf.q.bgpSessionStatus().answer().frame()
# Are there ACL lines that can never match?bf.q.filterLineReachability().answer().frame()
# Does a critical flow still get through after this change?bf.q.reachability( pathConstraints=PathConstraints(startLocation="leaf1"), headers=HeaderConstraints(dstIps="10.20.0.0/24", dstPorts="443"), actions="SUCCESS",).answer().frame()A reachability test that flips from SUCCESS to FAIL between the current and candidate snapshots is a change that breaks production traffic — and you found it in a pipeline, before touching anything.
The GitLab Pipeline
stages: [render, validate, diff, deploy, verify]
render: stage: render script: - python render_from_netbox.py --out configs/ artifacts: paths: [configs/]
validate: stage: validate script: - python batfish_checks.py configs/ # fails job on any check regression
diff: stage: diff script: - python napalm_diff.py configs/ # dry-run, prints per-device diff artifacts: paths: [diffs/]
deploy: stage: deploy when: manual # human approval gate environment: production script: - python napalm_deploy.py configs/ --backup-dir backups/
verify: stage: verify needs: [deploy] script: - python verify_state.py || python rollback.py backups/when: manual on deploy is the gate. The diffs are an artifact on the merge request, so a reviewer approves the actual change, not a description of it. Nothing reaches a device until a human clicks deploy with the diff in front of them.
Deploy and Backup Together
The deploy step’s first job is to capture the running config so rollback is real, not theoretical:
from napalm import get_network_driver
def deploy(host, driver, candidate, backup_dir): d = get_network_driver(driver)(host, USER, PW) d.open() # backup BEFORE changing anything running = d.get_config()["running"] open(f"{backup_dir}/{host}.cfg", "w").write(running) # load, diff once more, commit d.load_replace_candidate(config=candidate) diff = d.compare_config() if diff: d.commit_config() else: d.discard_config() # don't leave a candidate locked on the device d.close()load_replace_candidate (full replace) plus a captured backup means rollback is “load the backup and commit” — deterministic, not a human retyping under pressure.
Verify, and Roll Back on Failure
Deployment is not done when the commit succeeds. It is done when the network still works. The verify stage proves it:
# verify_state.py — exit non-zero to trigger rollbackchecks = [ bgp_neighbors_established(expected=12), interface_up("Ethernet1/1"), ping("10.20.0.1", count=5, loss_max=0),]sys.exit(0 if all(checks) else 1)If verify exits non-zero, the pipeline runs rollback.py, which reloads the backups captured in the deploy step. The window between “bad change committed” and “previous config restored” is the length of one pipeline stage, automatically, with no one paging the on-call to remember what the config used to be.
Junos Makes Part of This Free
On platforms with native candidate config and confirmed commit, lean on it as a second safety net:
commit confirmed 5If the pipeline’s verify stage does not issue a final commit within five minutes, the device rolls back on its own. Even if your CI runner dies mid-deploy, the router un-breaks itself. Use both — pipeline rollback for the general case, commit confirmed for the “the automation itself failed” case.
Comparing Snapshots, Not Just Checking One
A single Batfish snapshot answers “is this config valid?” The more useful question is “did this change break something that worked before?” Batfish models a reference snapshot (current production) and a candidate snapshot (the MR) side by side, and the diff of their answers is the signal:
from pybatfish.client.session import Session
bf = Session(host="batfish")bf.init_snapshot("./configs-current", name="reference", overwrite=True)bf.init_snapshot("./configs-candidate", name="candidate", overwrite=True)
# Sessions that are up in reference but down in candidate = a regression this MR introducesref = bf.q.bgpSessionStatus().answer(snapshot="reference").frame()cand = bf.q.bgpSessionStatus().answer(snapshot="candidate").frame()
# Routes that resolve now but won't after the changebf.q.differentialReachability( headers=HeaderConstraints(dstIps="10.20.0.0/24"),).answer(snapshot="candidate", reference_snapshot="reference").frame()differentialReachability is the one to wire into the gate: it returns exactly the flows whose fate changed between the two snapshots. An empty result means this MR does not alter who can reach what. A non-empty result is the list of conversations the reviewer must consciously sign off on. That reframes review from reading config lines to approving behavioral deltas.
Failure Drill: The Verify Stage Caught a Bad Commit
Walk the unhappy path end to end, because the unhappy path is the whole reason to build this. An MR retags a transit interface into the wrong VRF. Batfish passes — the config is syntactically valid and the VRF exists. NAPALM’s diff looks small and plausible. A reviewer clicks deploy.
deploy: config committed on leaf3, backup saved to backups/leaf3.cfgverify: bgp_neighbors_established(expected=12) -> got 10 FAILTwo sessions across that interface dropped because the neighbor is now in a different VRF. verify_state.py exits non-zero, the pipeline runs rollback.py, and the captured backup reloads:
from napalm import get_network_driverimport glob, os
for path in glob.glob(f"{BACKUP_DIR}/*.cfg"): host = os.path.basename(path).removesuffix(".cfg") d = get_network_driver("ios")(host, USER, PW) d.open() d.load_replace_candidate(filename=path) # full replace back to pre-change if d.compare_config(): d.commit_config() d.close()Total exposure is one verify stage plus one rollback stage — call it two minutes — and no human had to remember what leaf3 looked like before. Contrast the SSH-and-paste version: the same mistake is discovered when a service team opens a ticket an hour later, and the fix is an engineer reconstructing the old config from memory.
One real gotcha: the verify stage must connect through a path the change cannot sever. If your runner reaches leaf3 only via the link the MR just broke, verify times out for the wrong reason and rollback may not reach the device either. Verify and roll back over out-of-band management, never the production data path.
What Changes Culturally
The technology is the easy part. The shift is that “I’ll just fix it real quick on the box” becomes a merge request. That feels slower for a one-line change — and it is, by a few minutes. But the one-line change that withdraws the wrong prefix and blackholes a /16 is exactly the change that “felt too small to review.” Put every change through the pipeline, including the small ones, because the small ones are the ones that cause the outages nobody saw coming.