CI/CD for Network Configs: Test and Rollback with GitLab

April 28, 2026 · 7 min read

The way most network changes still happen: open a ticket, SSH into the box at a maintenance window, paste config, hope. There is no review, no test, and “rollback” means frantically retyping from memory while the link is down. We solved this for application code a decade ago with CI/CD. The same pipeline works for network config — and the payoff is even bigger, because a bad route affects everyone, not one service.

The model: config lives in git, changes go through merge requests, a pipeline validates them offline, and deployment is a gated, diffed, reversible step.

The Pipeline Shape

  commit -> MR
     │
   render      generate intended config from source-of-truth data
     │
   validate    Batfish: offline correctness checks, no device touched
     │
   diff        NAPALM dry-run: show what would change on each device
     │  (manual approval gate)
   deploy      commit the change, capture pre-change backup
     │
   verify      reachability / protocol state; rollback on failure

The crucial idea is that everything before deploy touches no device. You catch the broken ACL, the typo’d next-hop, the accidentally-withdrawn prefix, in CI — not on the production router.

Offline Validation with Batfish

Batfish builds a model of your network from the config text and answers questions about it without ever connecting to a device. It catches the failures that a syntax check misses — unreachable ACL lines, BGP sessions that will not come up, routes that will not resolve.

from pybatfish.client.session import Session
from pybatfish.datamodel.flow import HeaderConstraints, PathConstraints

bf = Session(host="batfish")
bf.init_snapshot("./configs", name="candidate", overwrite=True)

# Will any BGP session fail to establish?
bf.q.bgpSessionStatus().answer().frame()

# Are there ACL lines that can never match?
bf.q.filterLineReachability().answer().frame()

# Does a critical flow still get through after this change?
bf.q.reachability(
    pathConstraints=PathConstraints(startLocation="leaf1"),
    headers=HeaderConstraints(dstIps="10.20.0.0/24", dstPorts="443"),
    actions="SUCCESS",
).answer().frame()

A reachability test that flips from SUCCESS to FAIL between the current and candidate snapshots is a change that breaks production traffic — and you found it in a pipeline, before touching anything.

The GitLab Pipeline

stages: [render, validate, diff, deploy, verify]

render:
  stage: render
  script:
    - python render_from_netbox.py --out configs/
  artifacts:
    paths: [configs/]

validate:
  stage: validate
  script:
    - python batfish_checks.py configs/   # fails job on any check regression

diff:
  stage: diff
  script:
    - python napalm_diff.py configs/       # dry-run, prints per-device diff
  artifacts:
    paths: [diffs/]

deploy:
  stage: deploy
  when: manual                              # human approval gate
  environment: production
  script:
    - python napalm_deploy.py configs/ --backup-dir backups/

verify:
  stage: verify
  needs: [deploy]
  script:
    - python verify_state.py || python rollback.py backups/

when: manual on deploy is the gate. The diffs are an artifact on the merge request, so a reviewer approves the actual change, not a description of it. Nothing reaches a device until a human clicks deploy with the diff in front of them.

Deploy and Backup Together

The deploy step’s first job is to capture the running config so rollback is real, not theoretical:

from napalm import get_network_driver

def deploy(host, driver, candidate, backup_dir):
    d = get_network_driver(driver)(host, USER, PW)
    d.open()
    # backup BEFORE changing anything
    running = d.get_config()["running"]
    open(f"{backup_dir}/{host}.cfg", "w").write(running)
    # load, diff once more, commit
    d.load_replace_candidate(config=candidate)
    diff = d.compare_config()
    if diff:
        d.commit_config()
    else:
        d.discard_config()      # don't leave a candidate locked on the device
    d.close()

load_replace_candidate (full replace) plus a captured backup means rollback is “load the backup and commit” — deterministic, not a human retyping under pressure.

Verify, and Roll Back on Failure

Deployment is not done when the commit succeeds. It is done when the network still works. The verify stage proves it:

# verify_state.py — exit non-zero to trigger rollback
checks = [
    bgp_neighbors_established(expected=12),
    interface_up("Ethernet1/1"),
    ping("10.20.0.1", count=5, loss_max=0),
]
sys.exit(0 if all(checks) else 1)

If verify exits non-zero, the pipeline runs rollback.py, which reloads the backups captured in the deploy step. The window between “bad change committed” and “previous config restored” is the length of one pipeline stage, automatically, with no one paging the on-call to remember what the config used to be.

Junos Makes Part of This Free

On platforms with native candidate config and confirmed commit, lean on it as a second safety net:

commit confirmed 5

If the pipeline’s verify stage does not issue a final commit within five minutes, the device rolls back on its own. Even if your CI runner dies mid-deploy, the router un-breaks itself. Use both — pipeline rollback for the general case, commit confirmed for the “the automation itself failed” case.

Comparing Snapshots, Not Just Checking One

A single Batfish snapshot answers “is this config valid?” The more useful question is “did this change break something that worked before?” Batfish models a reference snapshot (current production) and a candidate snapshot (the MR) side by side, and the diff of their answers is the signal:

from pybatfish.client.session import Session

bf = Session(host="batfish")
bf.init_snapshot("./configs-current", name="reference", overwrite=True)
bf.init_snapshot("./configs-candidate", name="candidate", overwrite=True)

# Sessions that are up in reference but down in candidate = a regression this MR introduces
ref = bf.q.bgpSessionStatus().answer(snapshot="reference").frame()
cand = bf.q.bgpSessionStatus().answer(snapshot="candidate").frame()

# Routes that resolve now but won't after the change
bf.q.differentialReachability(
    headers=HeaderConstraints(dstIps="10.20.0.0/24"),
).answer(snapshot="candidate", reference_snapshot="reference").frame()

differentialReachability is the one to wire into the gate: it returns exactly the flows whose fate changed between the two snapshots. An empty result means this MR does not alter who can reach what. A non-empty result is the list of conversations the reviewer must consciously sign off on. That reframes review from reading config lines to approving behavioral deltas.

Failure Drill: The Verify Stage Caught a Bad Commit

Walk the unhappy path end to end, because the unhappy path is the whole reason to build this. An MR retags a transit interface into the wrong VRF. Batfish passes — the config is syntactically valid and the VRF exists. NAPALM’s diff looks small and plausible. A reviewer clicks deploy.

deploy:  config committed on leaf3, backup saved to backups/leaf3.cfg
verify:  bgp_neighbors_established(expected=12) -> got 10   FAIL

Two sessions across that interface dropped because the neighbor is now in a different VRF. verify_state.py exits non-zero, the pipeline runs rollback.py, and the captured backup reloads:

from napalm import get_network_driver
import glob, os

for path in glob.glob(f"{BACKUP_DIR}/*.cfg"):
    host = os.path.basename(path).removesuffix(".cfg")
    d = get_network_driver("ios")(host, USER, PW)
    d.open()
    d.load_replace_candidate(filename=path)   # full replace back to pre-change
    if d.compare_config():
        d.commit_config()
    d.close()

Total exposure is one verify stage plus one rollback stage — call it two minutes — and no human had to remember what leaf3 looked like before. Contrast the SSH-and-paste version: the same mistake is discovered when a service team opens a ticket an hour later, and the fix is an engineer reconstructing the old config from memory.

One real gotcha: the verify stage must connect through a path the change cannot sever. If your runner reaches leaf3 only via the link the MR just broke, verify times out for the wrong reason and rollback may not reach the device either. Verify and roll back over out-of-band management, never the production data path.

What Changes Culturally

The technology is the easy part. The shift is that “I’ll just fix it real quick on the box” becomes a merge request. That feels slower for a one-line change — and it is, by a few minutes. But the one-line change that withdraws the wrong prefix and blackholes a /16 is exactly the change that “felt too small to review.” Put every change through the pipeline, including the small ones, because the small ones are the ones that cause the outages nobody saw coming.