SNMP has monitored networks for thirty years, and the model has not changed: a poller wakes up every 30 or 60 seconds, asks each device for a pile of OIDs, and hopes nothing interesting happened in between. Microbursts, transient drops, a flap that healed in ten seconds — invisible. You polled the gaps.
Streaming telemetry inverts it. The device pushes data on change or on a tight interval, structured by a data model. gNMI is the gRPC-based protocol that does this, and once you have it running, the 60-second blind spot disappears.
Why Streaming Beats Polling
| SNMP poll | gNMI stream | |
|---|---|---|
| Direction | Collector pulls | Device pushes |
| Granularity | Poll interval (30–60s) | Sub-second / on-change |
| Data model | MIBs/OIDs | OpenConfig/YANG paths |
| Transport | UDP, often plaintext | gRPC over TLS |
| Microburst visibility | None | Yes |
The deeper win is the data model. OpenConfig paths are vendor-neutral and human-readable — /interfaces/interface[name=Ethernet1]/state/counters/in-octets means the same thing everywhere, versus hunting for the right vendor OID.
Subscriptions: ON_CHANGE vs SAMPLE
gNMI has two subscription modes, and picking the right one per metric is the whole game:
- ON_CHANGE — the device sends an update only when the value changes. Perfect for state: interface oper-status, BGP session state, an alarm. You learn about a link going down the instant it happens, not at the next poll.
- SAMPLE — the device sends the value every N. Right for counters and gauges: octets, errors, CPU, temperature.
Mixing them correctly is what keeps the data volume sane. Streaming every counter on-change would be a firehose; sampling oper-status would reintroduce the blind spot.
gnmic: The Swiss-Army Client
gnmic is the practical tool for subscribing, testing paths, and shipping data. Start by confirming a path returns data:
gnmic -a 10.0.0.11:57400 -u admin -p secret --skip-verify \ get --path "/interfaces/interface[name=Ethernet1]/state/counters"A subscription on the CLI, to eyeball the stream:
gnmic -a 10.0.0.11:57400 -u admin -p secret --skip-verify \ subscribe \ --path "/interfaces/interface/state/counters" \ --stream-mode sample --sample-interval 10sFor state, switch the mode:
gnmic -a 10.0.0.11:57400 -u admin -p secret --skip-verify \ subscribe \ --path "/interfaces/interface/state/oper-status" \ --stream-mode on-changeLanding Data in Prometheus
The part people dread — writing a collector — gnmic does for you. It has a built-in Prometheus output that exposes a /metrics endpoint Prometheus scrapes. Driven by a config file:
targets: 10.0.0.11:57400: username: admin password: secret skip-verify: true 10.0.0.12:57400: username: admin password: secret skip-verify: true
subscriptions: if-counters: paths: - "/interfaces/interface/state/counters" stream-mode: sample sample-interval: 10s if-state: paths: - "/interfaces/interface/state/oper-status" - "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state/session-state" stream-mode: on-change
outputs: prom: type: prometheus listen: ":9804"
processors: {}gnmic --config gnmic.yaml subscribePrometheus scrapes :9804, and the OpenConfig path becomes a labeled metric. No custom exporter, no SNMP translation layer.
scrape_configs: - job_name: gnmic static_configs: - targets: ["gnmic-host:9804"]Querying It
Because paths carry labels (interface name, device target), PromQL queries read naturally. Interface input rate in bits/sec:
rate(interfaces_interface_state_counters_in_octets[1m]) * 8A BGP session that left Established — instantly, because it streamed on-change:
network_instance_..._bgp_neighbor_state_session_state != 6Operational Reality
A few things that bite on the way in:
- TLS. gNMI is gRPC over TLS.
--skip-verifyis fine in a lab; in production deploy real certs. The transport is encrypted by default — a genuine improvement over SNMPv2 in clear text. - Cardinality. Subscribing to every path on every interface on a large device generates a lot of series. Subscribe to what you will actually alert or graph on, not everything the model exposes.
- Vendor path drift. “OpenConfig” support varies. Some platforms expose native YANG paths that differ from upstream OpenConfig. Test paths with
gnmic getbefore building dashboards on them.
Encoding and the Path Trap
Two choices quietly decide whether a subscription works at all: encoding and where the path is rooted.
gNMI updates are encoded as JSON, JSON_IETF, or PROTO. Vendors disagree on the default — push the wrong one and the device either errors or sends data your downstream cannot parse. Be explicit:
gnmic -a 10.0.0.11:57400 -u admin -p secret --skip-verify \ --encoding json_ietf \ subscribe --path "/interfaces/interface/state/counters" \ --stream-mode sample --sample-interval 10sThe subtler trap is the difference between /state and /config subtrees. /state is operational, what is actually happening — that is what you want for telemetry. /config is intended config, which barely changes and is useless as a counter source. A common rollout mistake is subscribing to /interfaces/interface/state/counters and getting nothing because the platform roots counters under a native YANG path that only resembles OpenConfig. The fix is always the same: confirm the exact path with gnmic get against one device, paste the path that returned data into the subscription, and never hand-type a path from a model document and assume it maps.
Dial-In vs Dial-Out, and the Collector Question
The examples above are dial-in: gnmic opens the gRPC session to the device and subscribes. That is the simplest model and the right default. It breaks down at scale for one reason — gnmic holds a long-lived connection to every target, and a single gnmic process subscribing to a thousand devices is a bottleneck and a single point of failure.
Two ways out. The first is clustering: gnmic supports a clustered mode backed by a locker (Consul) so multiple instances share the target load and fail over:
# gnmic.yaml — clustered collectorclustering: cluster-name: telemetry locker: type: consul address: consul:8500 targets-watch-timer: 30sThe second is dial-out, where the device initiates the connection to the collector (gNMI’s dial-out / gRPC tunnel, or vendor MDT). Dial-out survives NAT and firewalls better — the device reaches out — but you give up gnmic’s clean per-target config and lean on the platform’s telemetry stanza instead. For a brownfield network behind firewalls, dial-out is often the only thing that works; for a flat, reachable fabric, dial-in with clustering is simpler to reason about.
Verifying the Stream Is Actually Flowing
A subscription that silently stops is worse than SNMP, because nobody is polling to notice. Confirm three things on rollout.
Does the path even exist on this platform? Capabilities and a one-shot get answer that before you build a dashboard on a path the device does not export:
gnmic -a 10.0.0.11:57400 -u admin -p secret --skip-verify capabilities
gnmic -a 10.0.0.11:57400 -u admin -p secret --skip-verify \ get --path "/interfaces/interface[name=Ethernet1]/state/oper-status"Is gnmic itself healthy? It exposes its own operational metrics on the Prometheus endpoint. Scrape and alert on them:
# subscriptions that have gone quiet — a target that stopped streamingrate(gnmic_subscribe_number_of_received_subscribe_response_messages_total[5m]) == 0A target up in NetBox but flat-lined in gnmic_subscribe_* is a device that accepted the subscription and then stopped sending — usually a daemon restart on the box that did not re-establish. Alert on the absence of updates, not just on bad values, or the blind spot you eliminated comes back through the side door.
Enriching the Stream Before It Hits Prometheus
Raw OpenConfig paths make ugly metric names and carry only the labels the device sends — usually just the interface name and the gNMI target. gnmic’s processors rewrite the stream in flight, so the data lands in Prometheus already shaped for querying instead of needing relabeling rules on the scrape side.
processors: add-site-label: event-add-tag: value-names: ["."] # match any value -> always add the tag add: site: dc1 drop-admin-down: event-allow: condition: '.values."oper-status" == "UP"'
# processors attach under the OUTPUT, not the subscriptionoutputs: prom: type: prometheus listen: ":9804" event-processors: - add-site-label - drop-admin-downevent-add-tag stamps a site label so PromQL can aggregate per data center without a join. event-allow/event-drop filter at the collector — drop counters from admin-down interfaces and you cut series count before it ever costs you Prometheus memory. Doing this in gnmic rather than in Prometheus relabeling keeps the cardinality fix close to the source, where you can see exactly which paths you are keeping.
A scaling note: cardinality is the cost that creeps. Every label value is a time series, so a description label that operators edit freely will churn series on every config change. Stick to stable labels — interface name, site, target — and resist the urge to stream the human-readable description as a metric label.
What I Keep SNMP For
Streaming does not retire SNMP overnight. Old gear that does not speak gNMI, and the occasional MIB with data no model exposes yet, keep SNMP alive at the edges. Run both: gNMI for everything that supports it, SNMP as the fallback for what does not. But for any device that speaks gNMI, the 60-second polling blind spot is a choice you no longer have to make — and the first time streaming catches a microburst that SNMP would have missed entirely, the switch pays for itself.