🚀 Executive Summary

TL;DR: Organizations annually waste millions on forgotten SaaS subscriptions and zombie cloud infrastructure due to rapid development and lack of cleanup ownership. This waste can be stopped by implementing a multi-tiered strategy, starting with manual audits, progressing to automated policy-as-code enforcement and ‘Cloud Janitor’ functions, and ultimately utilizing ephemeral sandbox accounts for non-production environments.

🎯 Key Takeaways

  • Untagged resources are a primary indicator of waste; use AWS CLI queries to identify resources missing critical tags like `owner-email` or `project-code` for manual ‘Tag & Bag’ audits.
  • Automate resource lifecycle management using ‘Cloud Janitor’ functions (e.g., AWS Lambda) that scan for `ttl-days` or `creation-date` tags, sending warnings or automatically stopping/deleting expired resources, enforced by policies like AWS Service Control Policies (SCPs).
  • Implement ‘Ephemeral Sandbox Accounts’ for non-production environments, programmatically wiping them clean on a regular schedule (e.g., every 90 days) to force teams into adopting Infrastructure as Code (IaC) and prevent cruft accumulation.

$21 million annually wasted on unused SaaS. 
Here's how to see it (and stop it).

Unused SaaS and zombie cloud infrastructure are silently draining your budget. A Senior DevOps Engineer breaks down the real-world, in-the-trenches methods to find, stop, and prevent millions in annual waste.

$21 Million Down the Drain: A DevOps War Story on Ghost SaaS & Cloud Waste

I still remember the meeting. Our finance lead, looking pale, pulled up a slide showing a $15,000 monthly charge from a log analytics SaaS provider we supposedly “migrated away from” six months prior. It turned out a single, forgotten log forwarder on an old bastion host, `bastion-prod-01`, was still dutifully shipping terabytes of data into the void, and our credit card was dutifully paying for it. No one noticed until the bill hit a critical threshold. This isn’t a rare story; it’s a rite of passage in the cloud world. That Reddit thread about $21 million in waste didn’t surprise me one bit. We, the engineers, create this mess, so it’s on us to fix it.

The “Why”: How We Get Here in the First Place

This isn’t about incompetence; it’s about entropy. The “move fast and break things” culture has a dark side: “move fast and forget things.” A developer spins up a proof-of-concept with a shiny new SaaS tool using a corporate card. The PoC fails, the dev moves to a new project, and the subscription auto-renews forever. A team provisions a massive Kubernetes cluster for a project that gets de-prioritized, but no one ever runs `terraform destroy`. There’s no single owner for the “clean-up” phase of the project lifecycle. This is death by a thousand papercuts, and it adds up to millions.

The Fixes: From Duct Tape to Fort Knox

Look, there’s no magic bullet. But there are battle-tested strategies. I’m going to walk you through three levels of solutions, from the quick-and-dirty fix you can do this afternoon to the long-term architectural changes that will make your CFO love you.

1. The Quick Fix: The ‘Tag & Bag’ Audit

This is the manual, brute-force, “we’re bleeding money now” approach. It’s hacky, but it works. The goal is to hunt down resources that have no clear owner or purpose. In the cloud, our best weapon for this is tagging.

Your mission is to find anything that isn’t tagged correctly. I’m talking about EC2 instances, RDS databases, S3 buckets, you name it. A simple starting point is to look for resources missing a crucial tag like owner-email or project-code.

Here’s a simple AWS CLI one-liner I’ve used countless times to find untagged EC2 instances. It’s ugly, but it gets the job done.

aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId, Tags[?Key==`owner-email`].Value | [0]]' --output text | grep "None$"

This spits out a list of instance IDs that are missing the owner-email tag. Now you can play detective. Check the instance name, security groups, and creation date to figure out what it is and who might own it. For SaaS, you have to go to your finance or procurement department and get a list of all recurring software charges. Then, cross-reference that with your known, actively-used tools. What’s left over is your investigation list.

2. The Permanent Fix: The ‘Cloud Janitor’ & Policy as Code

After you’ve stopped the immediate bleeding, you need to prevent it from happening again. This is where automation and policy become your best friends. The goal is to make it impossible to create untracked resources and to automatically clean up old ones.

Step 1: Enforce Tagging. Use something like AWS Service Control Policies (SCPs) to deny the creation of resources (like EC2, RDS) if they don’t have specific tags (e.g., owner-email, project-code, ttl-days).

Step 2: Automate The Cleanup. This is my favorite part. We build a “Cloud Janitor.” This is typically a set of Lambda functions that run on a schedule (e.g., nightly). It scans all resources for a `ttl-days` (Time To Live) tag or a `creation-date` tag.

  • If a resource is N days away from its expiration, the janitor sends a warning email or a Slack message to the address in the owner-email tag.
  • If the resource is past its expiration date, the janitor automatically stops it (for VMs) or snapshots and deletes it (for databases).

Pro Tip: Start your Cloud Janitor in “dry run” mode first! Have it only log what it would terminate. Trust me, you don’t want to accidentally delete `prod-db-01` because someone fat-fingered a tag.

3. The ‘Nuclear’ Option: Ephemeral Sandbox Accounts

This is my slightly controversial, but incredibly effective, solution for non-production environments. The problem with dev/staging/testing environments is that they become a graveyard of half-finished experiments. The solution? Burn the whole graveyard to the ground on a regular schedule.

Here’s the strategy: You give teams dedicated AWS accounts for development and experimentation. The catch? These accounts are programmatically wiped clean every 90 days. Everything is deleted.

The first time you do this, people will scream. They’ll have lost their precious, manually-configured EC2 instance. But the second time, they’ll have learned a valuable lesson: if it’s not in Terraform (or your IaC tool of choice), it doesn’t exist.

This approach forces good behavior. It ensures all infrastructure is codified, repeatable, and can be spun up from scratch. It makes cruft accumulation literally impossible.

Warning: I cannot stress this enough. This is for NON-PRODUCTION accounts only. Applying this to a production account is what we call a “resume-generating event.”

Comparison of Solutions

Solution Effort to Implement Effectiveness Risk Level
1. Tag & Bag Audit Low (Hours) Medium (One-time fix) Low
2. Cloud Janitor Medium (Weeks) High (Long-term prevention) Medium
3. Nuclear Option High (Requires buy-in) Very High (Eliminates cruft) High (If misconfigured)

There’s no excuse for letting millions of dollars evaporate into the ether of unused cloud resources and SaaS subscriptions. It starts with a manual audit, but it has to end with automation. Pick a strategy, get started, and stop the bleeding.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ How can organizations identify and mitigate ‘ghost SaaS’ and ‘zombie cloud infrastructure’ waste?

Organizations should start with a ‘Tag & Bag’ audit to manually identify untagged cloud resources and cross-reference finance records for unused recurring SaaS charges. Subsequently, implement automated ‘Cloud Janitor’ systems with policy-as-code to enforce tagging and scheduled cleanup, and consider ‘Ephemeral Sandbox Accounts’ for non-production environments to prevent cruft.

âť“ How do the ‘Tag & Bag Audit,’ ‘Cloud Janitor,’ and ‘Ephemeral Sandbox Accounts’ approaches compare in terms of effort, effectiveness, and risk?

The ‘Tag & Bag Audit’ is low effort, medium effectiveness (one-time fix), and low risk. The ‘Cloud Janitor’ is medium effort, high effectiveness (long-term prevention), and medium risk. The ‘Nuclear Option’ (Ephemeral Sandbox Accounts) is high effort (requires buy-in), very high effectiveness (eliminates cruft), and high risk if misconfigured, strictly for non-production.

âť“ What is a common implementation pitfall when automating cloud resource cleanup, and how can it be avoided?

A common pitfall is accidentally terminating critical production resources. This can be avoided by initially running automated cleanup tools like the ‘Cloud Janitor’ in ‘dry run’ mode to only log actions without executing them, and by strictly limiting ‘Ephemeral Sandbox Accounts’ to non-production environments only.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading