🚀 Executive Summary

TL;DR: Network teams often struggle with vendor lock-in and manual “ClickOps” for device configuration, which clashes with modern GitOps practices used elsewhere in the stack. To overcome this, organizations are building in-house network automation tools, evolving from quick scripts to robust internal platforms or abstraction layers that enable declarative, auditable, and scalable network management.

🎯 Key Takeaways

  • The core problem in network automation is the fundamental clash between manual “ClickOps” via vendor GUIs and the declarative, code-driven “GitOps” approach used for the rest of the IT stack.
  • In-house network automation solutions typically evolve through three stages: “Glorified Scripts” for immediate, specific fixes; “Internal Platforms” for comprehensive state management and idempotency; and “Abstraction Layers” that simplify existing IaC tools like Ansible or Terraform.
  • An “Internal Platform” for network automation consists of a Source of Truth (database for intended state), a Reconciler (worker comparing states and applying changes), and an API (interface for users/services to interact with the source of truth).

What in-house tools are you building or using for network automation?

Tired of vendor lock-in and clunky GUIs for network automation? We explore the real-world, in-house tools teams are building, from glorified scripts to full-blown internal platforms that finally treat network devices like cattle, not pets.

Build vs. Buy: What In-House Network Automation Tools Are We *Actually* Building?

I remember it like it was yesterday. 3 AM. The pager goes off because our brand-new service, `prod-app-west-3b`, can’t reach the primary database cluster, `prod-db-cluster-01`. The error is obvious: connection timeout. It’s a firewall rule. A simple, five-minute fix. Except that “fix” involved logging into a god-awful, slow-as-molasses vendor UI, clicking through seven sub-menus, and carefully typing an IP address into a tiny text box, praying I didn’t make a typo. That 3 AM rage-session was the moment we decided, “Never again. We’re building our own way.”

The Root of the Problem: ClickOps vs. GitOps

I was browsing a Reddit thread the other day asking this exact question, and it’s clear my 3 AM nightmare is a universal experience. The core issue is a fundamental clash of philosophies. The rest of our stack is managed declaratively via code—Terraform for cloud infra, Kubernetes manifests for services, Ansible for config management. It’s all in Git, it’s peer-reviewed, and it’s auditable. Then you hit the network layer, and it’s like stepping back in time. You’re suddenly forced into “ClickOps”—manually navigating a graphical interface designed by someone who has clearly never been on-call.

Vendors sell you a “single pane of glass,” but often it’s just a single point of failure and frustration. Their APIs are often afterthoughts, poorly documented, or just RPC wrappers for the same slow backend the UI uses. So, we do what any good engineer does: we fix the frustrating process with tools.

The Solutions: From Duct Tape to a Polished Platform

Looking at the discussion and my own team’s journey, the home-grown solutions generally fall into three categories. We’ve been through all three stages at TechResolve.

Solution 1: The “Glorified Scripts” Approach (The Quick Fix)

This is where everyone starts. You have an immediate, painful problem, like updating a firewall ACL or adding a VLAN to a switch port. You don’t need a whole platform; you just need to stop the bleeding.

This “tool” is often just a collection of Python or Go scripts living in a Git repo. They use libraries like netmiko or paramiko to SSH into devices and run commands. To make it usable by the wider team, you might wrap it in a simple Flask API or, more commonly, a Jenkins job with parameters.

Example: A basic Python script to open a port on a firewall.


import netmiko
import sys

def open_firewall_port(device_ip, username, password, source_ip, dest_port):
    """Connects to a device and adds a simple ACL rule."""
    cisco_fw = {
        'device_type': 'cisco_ios',
        'host': device_ip,
        'username': username,
        'password': password,
    }

    try:
        net_connect = netmiko.ConnectHandler(**cisco_fw)
        print(f"Successfully connected to {device_ip}")

        # Example commands for a Cisco ASA/IOS style device
        commands = [
            'config t',
            f'access-list outside_access_in extended permit tcp host {source_ip} any eq {dest_port}',
            'exit'
        ]
        
        output = net_connect.send_config_set(commands)
        print("Configuration sent:")
        print(output)
        
        net_connect.disconnect()
        print("Done.")

    except Exception as e:
        print(f"Failed to connect or configure {device_ip}: {e}")
        sys.exit(1)

# This would be triggered by a Jenkins job or CLI wrapper
# open_firewall_port('10.1.1.254', 'admin', 'supersecret', '203.0.113.50', '443')

Warning: This approach is fast but brittle. There’s no state management, no real error handling, and no idempotency. If you run the script twice, it might add the rule twice or just fail. It’s a sharp tool for a specific problem, and that’s it.

Solution 2: The “Internal Platform” (The Permanent Fix)

After the glorified scripts start to multiply and become unmanageable, you evolve. This is where you build a proper, albeit minimal, internal platform. This isn’t about just running commands anymore; it’s about managing state.

At TechResolve, we call ours “NetBridge.” At its core, it has three components:

  • A Source of Truth: A simple database (we use PostgreSQL) that stores the intended state of your network. For example, a table called `firewall_rules` contains every rule that *should* exist.
  • A Reconciler: A worker process (we use Celery with Redis) that constantly compares the intended state in the database with the actual state on the devices. If there’s a drift, it generates and applies the necessary changes to fix it.
  • An API: A simple API (we use FastAPI) that allows users and services to interact with the source of truth, not the devices directly. A user POSTs a new rule to the API, which just adds a row to the database. The reconciler handles the rest.

This approach turns your network configuration into a real data model. You get an audit log for free, you can build a simple UI on top of the API, and you can enforce policies and validation at the API layer before a change is even attempted.

Solution 3: The “Abstraction Layer” (The Pragmatic Middle Ground)

Sometimes, building a full-blown state-management platform is overkill. You already have great tools like Ansible and Terraform, but they are too complex for the average developer who just wants to open a port for their new microservice.

The “Abstraction Layer” approach focuses on building a simplified, opinionated interface on top of these powerful tools. You’re not writing low-level device interaction code. Instead, you’re creating a “golden path” for common operations.

This could look like:

  • A “Service Catalog” of Ansible Roles: You write a set of highly reliable, idempotent Ansible roles (e.g., `create-vlan`, `update-acl`, `configure-bgp-peering`). Then you expose them through AWX/Tower or a Jenkins pipeline with a very simple, constrained set of parameters.
  • A Custom Terraform Provider: For more advanced teams, you can write your own internal Terraform provider that wraps your company’s specific logic. This allows teams to define network resources in their own HCL files, but the provider handles the translation to the vendor-specific API calls or SSH commands.

This is my favorite approach for most teams because it leverages the battle-tested power of existing IaC tools while providing a simple, safe interface for the rest of the organization.

Comparison at a Glance

Approach Pros Cons Best For
1. Glorified Scripts – Extremely fast to implement
– Solves immediate pain points
– Brittle, not idempotent
– No state management
– Doesn’t scale
Emergency fixes or very small teams with one-off tasks.
2. Internal Platform – Central source of truth
– Idempotent by design
– Highly scalable and auditable
– Significant engineering investment
– Can be complex to build/maintain
Larger organizations with a dedicated platform or SRE team.
3. Abstraction Layer – Leverages existing, robust tools
– Promotes GitOps practices
– Safer for wider org use
– Constrains users to “golden paths”
– Still reliant on underlying tools (Ansible/TF)
Most mid-to-large size teams who want to empower developers safely.

Ultimately, there’s no one-size-fits-all answer. We started with scripts out of necessity, evolved to an abstraction layer, and are now maturing parts of that into a true internal platform. The key is to start small, solve a real problem, and let your tooling evolve with your team’s needs. Just don’t let yourself get stuck in the 3 AM ClickOps nightmare.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ What are the main approaches to building in-house network automation tools?

The article outlines three main approaches: “Glorified Scripts” for quick fixes, “Internal Platforms” for comprehensive state management, and “Abstraction Layers” that simplify existing IaC tools like Ansible or Terraform.

âť“ How do in-house network automation solutions compare to vendor-provided tools?

In-house tools aim to overcome vendor lock-in, clunky GUIs, and poorly documented APIs often found in vendor solutions, allowing for custom, declarative, and GitOps-aligned network management, though they require significant engineering investment.

âť“ What is a common pitfall when starting with in-house network automation using scripts?

A common pitfall with “Glorified Scripts” is their brittleness, lack of state management, and non-idempotency. The solution is to evolve towards an “Internal Platform” or “Abstraction Layer” that incorporates state tracking, error handling, and idempotent operations.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading