🚀 Executive Summary
TL;DR: Manually fixing production issues outside of CI/CD pipelines creates ‘Configuration Drift’ and ‘Snowflake Servers,’ leading to unpredictable infrastructure. The solution involves maintaining a single ‘Source of Truth’ through CI/CD, adopting Immutable Infrastructure, and using Infrastructure as Code (IaC) for predictable, auditable deployments and rapid recovery.
🎯 Key Takeaways
- Bypassing CI/CD for manual production fixes leads to ‘Configuration Drift,’ breaking the ‘Source of Truth’ and causing unpredictable infrastructure behavior.
- A ‘Pipeline Override’ allows for emergency fixes by using flags (e.g., `[skip-tests]`) within the CI/CD process, ensuring changes are still logged and merged.
- Adopting ‘Immutable Infrastructure’ prevents in-place server fixes; instead, faulty instances are replaced with new ones built from Infrastructure as Code (IaC) scripts.
- The ‘Nuclear Option’ or ‘Tainting’ resources in IaC tools like Terraform forces the destruction and recreation of ‘Snowflake Servers’ from the last known good configuration.
- Prioritizing CI/CD and IaC over manual intervention significantly reduces risk, improves recovery time, and maintains infrastructure consistency.
In this post, I explain why bypassing your CI/CD pipeline for a “quick” manual fix on production is the fastest way to break your infrastructure and how to handle deployment bottlenecks professionally.
When The Answer is a Hard “No”: The Dangers of Manual Prod Tweaks
I remember it like it was yesterday. It was 4:45 PM on a Friday, and a junior dev slid into my DMs with a request that made my eye twitch: “Hey Darian, the migration is hanging on prod-db-01. Can you just give me the SSH keys for five minutes so I can manually run the SQL? I’ll update the repo later, I promise.” My response was a “No” so fast it probably broke the sound barrier. It wasn’t because I’m a gatekeeper; it’s because I’ve spent too many Saturdays rebuilding “Snowflake Servers” that someone “fixed” manually, leaving the rest of the team in the dark when the next automated deployment inevitably failed.
The root cause here isn’t just a slow migration script; it’s a fundamental breakdown of the Source of Truth. When you touch production manually, you create “Configuration Drift.” Your code says one thing, but your actual server says another. Suddenly, your Terraform plans start failing, your automated rollbacks become unpredictable, and you’ve effectively blinded your monitoring tools. You’re no longer practicing DevOps; you’re just winging it in a high-stakes environment.
Pro Tip: If it isn’t in Git, it doesn’t exist. If you change it manually in the console or via SSH, you haven’t fixed the problem—you’ve just hidden it for the next person to find.
| Approach | Risk Level | Recovery Time |
| Manual Hotfix | Extreme | Unpredictable |
| Pipeline Override | Medium | Fast (Automated) |
| Blue/Green Swap | Low | Instant |
Solution 1: The Quick Fix (The Pipeline Override)
If you’re in a genuine “Production Down” emergency, don’t bypass the pipeline—force it. Most CI/CD tools allow for environment variables or “skip” flags that can bypass lengthy test suites or non-critical checks while still logging the action. This ensures that even a “fast” fix is audited and the code is merged.
# Example: Using a "Emergency" flag in your Git commit to bypass non-critical CI stages
git commit -m "HOTFIX: Increase db connection timeout [skip-tests] [emergency-deploy]"
git push origin main
Solution 2: The Permanent Fix (Immutable Infrastructure)
Stop trying to fix servers while they are running. The better way is to move toward Immutable Infrastructure. If prod-db-01 is acting up, don’t login to it. Instead, spin up a new instance (prod-db-02) with the correct configuration using your IaC (Infrastructure as Code) scripts, test it, and swap the traffic. If the new one fails, you still have the old one to fail back to.
# Terraform snippet to trigger a replacement rather than an in-place update
resource "aws_instance" "prod_db" {
ami = "ami-0abcdef1234567890"
instance_type = "t3.large"
lifecycle {
create_before_destroy = true
}
}
Solution 3: The “Nuclear” Option (The Forced Rebuild)
When the drift is so bad that you don’t even know what’s wrong anymore, it’s time for the Nuclear Option. We call this “Tainting” the resource. You tell your orchestration tool that the current resource is poisoned and must be destroyed and recreated from scratch based on the last known good configuration in your repository.
# Forcing Terraform to kill the "Snowflake" and bring back a clean instance
terraform taint aws_instance.prod_db_01
terraform apply -auto-approve
Look, I get the pressure to move fast. But as a Senior Lead, my job is to make sure “fast” doesn’t become “catastrophic.” Next time you’re tempted to ask for that “quick SSH access,” take a breath and think about the pipeline instead. Your future self (and your sleep schedule) will thank you.
🤖 Frequently Asked Questions
❓ What is ‘Configuration Drift’ and why is it problematic?
‘Configuration Drift’ occurs when manual changes are made directly on production servers, causing their actual state to deviate from the version-controlled configuration (the ‘Source of Truth’). This leads to inconsistencies, unpredictable deployments, and makes automated rollbacks unreliable.
❓ How do different deployment approaches compare in terms of risk and recovery?
Manual Hotfixes carry Extreme risk with Unpredictable recovery. Pipeline Overrides have Medium risk with Fast (Automated) recovery. Blue/Green Swaps, often enabled by Immutable Infrastructure, offer Low risk with Instant recovery due to traffic shifting between identical environments.
❓ What is a common pitfall when facing a production emergency and how can it be avoided?
A common pitfall is the temptation to bypass the CI/CD pipeline for a ‘quick’ manual fix on production. This can be avoided by utilizing ‘Pipeline Overrides’ with emergency flags to push fixes through the pipeline, or by implementing ‘Immutable Infrastructure’ and IaC to replace faulty resources rather than modifying them in place.
Leave a Reply