🚀 Executive Summary
TL;DR: A junior engineer nearly caused a production disaster due to inadequate guardrails and overly permissive access. The solution involves implementing automated CI/CD pipelines with mandatory peer review, policy-as-code, and potentially ‘break glass’ access to prevent catastrophic infrastructure changes and ensure safe, predictable deployments.
🎯 Key Takeaways
- Root causes of production incidents often stem from overly permissive IAM roles, lack of mandatory peer review, poor environment separation, and a ‘cowboy’ culture.
- Automating guardrails via CI/CD pipelines with GitOps for Terraform, including `terraform plan` in Pull Requests and requiring code owner approval, significantly reduces the risk of human error.
- Policy-as-code tools like Open Policy Agent (OPA) or Sentinel can be integrated into pipelines to automatically enforce infrastructure policies and prevent non-compliant or oversized resource provisioning.
A junior engineer running a simple command that nearly vaporizes production is a rite of passage in DevOps. Here’s how to build guardrails to prevent costly mistakes without stifling your team’s ability to learn and contribute.
That Time a Junior Almost Cost Us a Fortune: Or, “Will Skills Cost Credits Too?”
I still remember the feeling. A Slack notification pops up from our new hire, Alex. It’s a screenshot of a Terraform plan, and my stomach just drops. The last line is glowing red: Plan: 10 to add, 8 to change, 212 to destroy. Two hundred and twelve resources… to destroy. In production. He was just trying to spin up a new staging environment by copying some of our prod modules, but a tiny mistake with his workspace selection targeted our entire EU customer-facing infrastructure. The only thing that saved us that day was a clunky, manual approval step I’d insisted on keeping. That incident, that near-miss, is exactly what a recent Reddit thread titled “Will skills cost credits too?” brought to mind. The fear isn’t that our people lack skill; it’s that the tools we give them have a razor-thin margin between “routine task” and “catastrophic failure.”
Why Does This Keep Happening? The Root Cause Isn’t the Junior.
It’s tempting to blame the “new guy” who ran the command. Don’t. If a single command from a junior engineer can nuke your production environment, the problem isn’t the engineer—it’s your process. You’ve essentially handed someone the keys to a Ferrari without explaining where the brakes are, or even better, installing an automatic braking system.
The root cause is almost always one of these:
- Overly Permissive IAM Roles: Giving developers `AdministratorAccess` “just to get them unblocked” is a ticking time bomb.
- Lack of Guardrails: No mandatory peer review process for infrastructure changes.
- Environment Bleed: Poor separation between development, staging, and production environments, making it easy to target the wrong one.
- A “Cowboy” Culture: A team culture where engineers apply changes directly from their laptops without a standardized pipeline.
Fixing the Problem: From Band-Aids to Body Armor
You can’t just tell people to “be more careful.” You have to build a system that makes being careful the path of least resistance. Here are three approaches, from the immediate fix to the long-term architectural solution.
Solution 1: The Quick Fix – The “Buddy System”
This is the immediate, low-tech solution you can implement this afternoon. For any environment deemed critical (like production or shared staging), you enforce a strict “two-person rule” for any state-changing operations.
No one, not even me, can run a terraform apply or kubectl apply -f . against prod-db-01 without getting a second pair of eyes on the plan first. It’s manual, it adds friction, but it stops the bleeding.
Your workflow might look like this:
- Engineer runs `terraform plan -out=prod.plan`.
- They post the plan output to a dedicated Slack channel like
#infra-prod-review. - A second engineer must review the plan and give a “LGTM” (Looks Good To Me) with a thumbs-up emoji.
- Only then can the original engineer run `terraform apply “prod.plan”`.
Warning: This is a stop-gap measure. It relies on human discipline, which can fail under pressure. It’s a “hacky” but effective way to immediately reduce risk while you build a better system.
Solution 2: The Permanent Fix – Automate the Guardrails with a Pipeline
This is where we put on our architect hats. The goal is to make it impossible for an individual to apply changes from their machine. All changes must go through a CI/CD pipeline, preferably triggered by a Pull Request.
A good GitOps workflow for Terraform looks like this:
- An engineer creates a feature branch (e.g., `feature/add-redis-cache`).
- They make their code changes and open a Pull Request against the `main` branch.
- The CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins) automatically triggers, running `terraform plan`.
- The plan’s output is posted as a comment directly on the Pull Request for everyone to see.
- The PR requires at least one approval from a designated code owner.
- Once approved and merged, the pipeline automatically runs `terraform apply` against the target environment.
Here’s a simplified example of a GitHub Actions step to achieve this:
- name: Terraform Plan
id: plan
if: github.event_name == 'pull_request'
run: |
terraform plan -no-color -input=false
continue-on-error: true
- name: Post Plan to PR
uses: actions/github-script@v6
if: github.event_name == 'pull_request'
env:
PLAN: "${{ steps.plan.outputs.stdout }}"
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const output = `#### Terraform Plan đź“–\`\`\`\n${process.env.PLAN}\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
})
Pro Tip: Supercharge this by adding policy-as-code tools like Open Policy Agent (OPA) or Sentinel. These can automatically fail a build if someone tries to, for example, create a publicly open S3 bucket or provision an oversized `r6g.16xlarge` database for a test environment.
Solution 3: The ‘Nuclear’ Option – “Break Glass” Access Only
In highly regulated or extremely sensitive environments, you take the previous solution one step further: you remove all human ability to make direct changes. Engineers do not have AWS credentials on their laptops that can modify production. Period.
All infrastructure modifications, without exception, MUST go through the audited CI/CD pipeline. What about emergencies? That’s where you implement a “break glass” procedure.
This is a formal, audited process for gaining temporary, elevated privileges. It might involve a PagerDuty escalation, approval from a director, and a tool like AWS IAM Identity Center (formerly SSO) to grant a specific role for a limited time (e.g., 60 minutes). Every action taken during this window is heavily logged and reviewed afterward.
Here’s a comparison of these access models:
| Approach | Pros | Cons |
|---|---|---|
| Direct Access | Fast, flexible, easy to get started. | Extremely risky, no audit trail, prone to error. |
| Pipeline + PRs | Auditable, peer-reviewed, significantly safer. | Slower than direct access, requires pipeline setup. |
| Break Glass Only | Maximum security and compliance, fully audited. | High friction, can slow down incident response if not well-rehearsed. |
Ultimately, the skills our teams possess are our greatest asset. But a skill without a safe environment to practice and apply it is a liability. It’s our job as senior engineers and architects not to blame them when they make a mistake, but to build the systems that turn potential “credit-costing” disasters into safe, boring, and predictable Tuesday deployments.
🤖 Frequently Asked Questions
âť“ How can organizations prevent junior engineers from accidentally destroying production environments?
Organizations must implement robust guardrails such as automated CI/CD pipelines, mandatory peer review processes for infrastructure changes, strict IAM role permissions, and clear environment separation to prevent catastrophic failures.
âť“ What are the different approaches to managing infrastructure access and changes?
Approaches include ‘Direct Access’ (fast, flexible, but extremely risky), ‘Pipeline + PRs’ (auditable, peer-reviewed, significantly safer), and the ‘Break Glass’ option (maximum security, fully audited, but high friction for incident response).
âť“ What is a common implementation pitfall when managing infrastructure changes?
A common pitfall is overly permissive IAM roles, such as giving developers `AdministratorAccess`. This can be solved by implementing least privilege principles and enforcing all infrastructure modifications through audited CI/CD pipelines rather than direct access from individual machines.
Leave a Reply