🚀 Executive Summary
TL;DR: Many organizations face ‘automation debt’ due to chaotic, unversioned scripts, hindering scalability and auditability. This guide outlines a three-stage DevOps automation journey, from ad-hoc scripting to a streamlined GitOps workflow, to help engineers escape ‘script hell’ and build robust, auditable systems.
🎯 Key Takeaways
- The three stages of DevOps automation are: Wild West (ad-hoc scripting), Framework (centralized & repeatable with CI/CD and config management), and Operating System (declarative GitOps with IaC and GitOps agents).
- Automation debt arises from prioritizing urgent, short-term fixes (like quick bash scripts) over important, long-term solutions, leading to unversioned and untrustworthy scripts.
- Transitioning effectively requires sequential progression: first, version control all scripts and introduce CI/CD (Stage 2), then gradually adopt Infrastructure as Code and GitOps agents for declarative state management (Stage 3).
A senior engineer’s guide to navigating the three stages of DevOps automation, from chaotic scripts to a streamlined GitOps workflow, and how to escape the “good enough” trap.
The Three Stages of DevOps Automation (And How to Escape Script Hell)
I remember my first week at a previous gig. My manager, a guy who lived on coffee and stress, told me to deploy the latest marketing service. “The script is on the ops share,” he said, waving vaguely towards the server room. I navigated to \\ops-shared\critical_scripts and my blood ran cold. It was a digital graveyard of good intentions: deploy.sh, deploy_new.sh, deploy_v2_final.sh, and my personal favorite, DO_NOT_USE_deploy_old.sh. That day, I learned the difference between having scripts and having automation. One is a collection of tools; the other is a system. We’ve all been there, and if you’re there now, let’s talk about how to get out.
The “Why”: The Automation Debt Spiral
Why do we end up with folders full of undeclared, unversioned, and untrustworthy scripts? It’s not because we’re bad engineers. It’s because we’re firefighters. A production server goes down at 3 AM. You write a quick bash script to check a process and restart it. It works. The fire is out. You save it as fix_prod_db_01.sh and move on to the next blaze. You’ve just taken on “automation debt.” It solved an immediate, urgent problem, but it wasn’t an important, long-term solution. Multiply that by a dozen engineers over three years, and you get the critical_scripts folder. The root cause is prioritizing the urgent over the important.
Stage 1: The Wild West (Ad-Hoc Scripting)
This is where everyone starts. It’s characterized by individual scripts, written to solve a specific problem, often living on an engineer’s laptop or a shared drive. This is the “just get it done” phase.
What it looks like:
- A collection of
.shor.ps1files. - Execution is manual. You SSH into
prod-worker-03and run./run_cleanup.sh. - There is no source control. The “latest version” is the one you edited last.
- Knowledge is tribal. Only Janet knows why
update_config_special.shneeds to be run with the--force-legacyflag.
Here’s a classic example of a “Stage 1” deployment script. It’s hacky, but we’ve all written one.
#!/bin/bash
# DEPLOY SCRIPT - DO NOT EDIT WITHOUT TALKING TO DAVE
echo "Connecting to production server..."
ssh user@prod-api-01 << 'ENDSSH'
echo "Stopping service..."
sudo systemctl stop my-api-service
echo "Pulling latest code from main..."
cd /var/www/my-api
git pull origin main
echo "Installing dependencies..."
npm install --production
echo "Restarting service..."
sudo systemctl start my-api-service
echo "Deployment complete on prod-api-01!"
ENDSSH
Warning: This stage is a ticking time bomb. It’s not scalable, it’s not auditable, and it’s one “fat finger” away from a production outage. The goal is to escape this stage as quickly as possible.
Stage 2: The Framework (Centralized & Repeatable)
This is the most critical leap in your automation journey. You stop thinking in terms of individual scripts and start thinking in terms of repeatable processes. The goal here is consistency and control, not perfection.
How to get there:
- Version Control Everything: The first step is non-negotiable. Create a Git repository (e.g.,
ops-automation) and commit every single script. Now you have history, accountability, and the ability to review changes. - Introduce a CI/CD Tool: Use Jenkins, GitLab CI, or GitHub Actions as your single point of execution. No more manual SSH sessions. The CI tool pulls the repo and runs the script. This gives you an audit log and control over who can run what.
- Adopt a Configuration Management Tool: Instead of raw bash scripts, start using a tool like Ansible or Puppet. This forces you to think declaratively (“ensure this package is installed”) instead of imperatively (“run apt-get install”).
Your “deployment” now becomes a CI/CD job that runs an Ansible playbook. The playbook itself is version-controlled.
# ansible/playbooks/deploy-api.yml
---
- hosts: api_servers
become: yes
tasks:
- name: Pull latest code from Git
git:
repo: 'git@github.com:your-org/my-api.git'
dest: /var/www/my-api
version: main
notify: Restart API Service
- name: Install NPM dependencies
npm:
path: /var/www/my-api
state: present
production: yes
handlers:
- name: Restart API Service
service:
name: my-api-service
state: restarted
Stage 3: The Operating System (Declarative GitOps)
This is the promised land. In this stage, your Git repository isn’t just a place to store scripts; it is the desired state of your entire infrastructure. You don’t “run” anything. You declare the state you want in Git, and an automated system makes it a reality. This is the ‘Nuclear’ option for killing off manual work for good.
What it looks like:
- Infrastructure as Code (IaC): Tools like Terraform or Pulumi define your servers, load balancers, and databases in code. A change to a server type is a pull request, not a panicked click in a cloud console.
- GitOps Agents: Tools like ArgoCD (for Kubernetes) or Flux constantly compare the live state of your environment to the desired state in your Git repo. If there’s a drift, they automatically correct it.
- The Human Role: Engineers don’t push buttons anymore. They write code, create pull requests, and get them reviewed. Merging to the main branch is the deployment.
# terraform/modules/s3/main.tf
# Defines a private S3 bucket for application logs
resource "aws_s3_bucket" "app_logs" {
bucket = "techresolve-prod-app-logs"
tags = {
Name = "Prod App Logs"
Environment = "Production"
ManagedBy = "Terraform"
}
}
resource "aws_s3_bucket_public_access_block" "app_logs_access" {
bucket = aws_s3_bucket.app_logs.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Pro Tip: Don’t try to jump from Stage 1 to Stage 3 overnight. You’ll burn out. The path is sequential. Get your scripts into Git (Stage 2). Then, start carving off one piece of your infrastructure—like S3 buckets or security groups—and manage it with Terraform (the beginning of Stage 3).
Your Journey at a Glance
It’s easy to get lost, so here’s a quick comparison of the stages.
| Attribute | Stage 1: Wild West | Stage 2: Framework | Stage 3: GitOps |
|---|---|---|---|
| Trigger | Manual (SSH & Run) | CI/CD Job (e.g., Jenkins) | Git Commit / Merge |
| Source of Truth | Engineer’s memory | Git Repo for Scripts/Playbooks | Git Repo for Desired State |
| Audit Trail | None / Bash History | CI/CD Job Logs | Git History (Immutable) |
| Scalability | Very Low | Medium | Very High |
Wherever you are on this journey, the key is to recognize it and take the next small, concrete step forward. If your scripts are a mess, your first step isn’t to learn Terraform. It’s to `git init` a new repository and commit that first script. You’re not alone in this, and every single senior engineer has that deploy_v2_final_REAL.sh file somewhere in their past.
🤖 Frequently Asked Questions
âť“ What are the three stages of DevOps automation?
The three stages are: Stage 1 (Wild West) characterized by ad-hoc, unversioned scripts; Stage 2 (Framework) involving centralized, repeatable processes with version control, CI/CD tools, and configuration management; and Stage 3 (Operating System) which is declarative GitOps using Infrastructure as Code and GitOps agents.
âť“ How does a GitOps approach (Stage 3) differ from traditional CI/CD (Stage 2)?
In Stage 2, CI/CD tools execute version-controlled scripts or playbooks, acting as the trigger for deployments. In Stage 3 (GitOps), the Git repository itself is the desired state of the infrastructure, and GitOps agents automatically reconcile the live environment with this declared state, making a Git commit or merge the immutable deployment trigger.
âť“ What is a common pitfall when trying to advance in the automation journey?
A common pitfall is attempting to jump directly from Stage 1 (Wild West) to Stage 3 (GitOps). The recommended approach is sequential: first, get scripts into version control and implement CI/CD (Stage 2), then gradually introduce IaC and GitOps for specific infrastructure components.
Leave a Reply