🚀 Executive Summary

TL;DR: Small, recurring cloud charges often signal systemic governance flaws like resource sprawl and untagged assets, costing significant engineering time. The solution involves quick CLI-based detection, enforcing mandatory tagging via IAM policies, and implementing automated ‘janitor’ functions to prevent future orphaned resources.

🎯 Key Takeaways

  • Mystery cloud charges stem from resource sprawl, lack of ownership (untagged resources), and overly permissive IAM permissions.
  • Utilize AWS CLI scripts to efficiently detect orphaned resources across all regions, such as unattached EBS volumes in the ‘available’ state.
  • Enforce mandatory tagging (e.g., “owner-email”, “project-name”) on resource creation using IAM policies or SCPs to prevent resource orphans and improve accountability.

Spent 28 minutes assembling evidence for a $42 chargeback today.At what point does this stop making sense?

Stop wasting hours tracking down tiny, recurring cloud charges. A senior DevOps engineer shares battle-tested strategies to find, fix, and prevent those mystery costs that bleed your budget and your time.

Hunting the $42 Ghost: When Cloud Cost Management Becomes a Time Sink

I remember it clearly. It was a Tuesday, and the monthly cloud bill had just dropped. My manager forwarded me an email from finance with a single line item highlighted: a recurring $12.72 charge for an ‘EBS gp2 Volume’ in ap-southeast-1. We didn’t have any production infrastructure in the Sydney region. For three months in a row, this tiny charge appeared, and for three months, someone spent an hour or two trying to find it before giving up. It became a joke, then an annoyance, and finally, my problem. That little $12 charge ended up costing us hundreds in engineering time. It wasn’t about the money; it was about the sloppiness it represented. We’ve all been there, staring at a bill, wondering if spending half a day hunting down a $42 charge is worth it. The answer is yes, but only if you use the opportunity to fix the systemic problem.

The “Why”: How These Ghosts Get in the Machine

These mystery charges aren’t bugs in the cloud provider’s billing system. They’re symptoms of a deeper issue. It usually boils down to a few key things:

  • Resource Sprawl: A junior engineer spins up a test EC2 instance, attaches a volume, terminates the instance but forgets the volume. It happens in a non-standard region they were testing for latency, and it’s forgotten forever.
  • Lack of Ownership: The resource has no tags. No owner tag, no project tag, no ttl tag. It’s an orphan, and nobody wants to delete it for fear of breaking something they don’t understand.
  • Permissive Permissions: Developers have overly broad IAM permissions, allowing them to create resources anywhere, anytime, without guardrails.

Fighting this isn’t just about deleting a resource; it’s about building a system where these ghosts can’t hide in the first place.

Solution 1: The Quick Fix – “The Cloud Detective Kit”

You need to find the thing, now. Your goal is to stop the bleeding. This is the manual, brute-force method, but you need to know how to do it.

For our phantom EBS volume, the first step was a simple script to iterate through every region. Don’t rely on the console; it’s too slow and you’ll miss something. Get comfortable with the CLI.

Example: Finding Unattached EBS Volumes with the AWS CLI

This is a quick and dirty bash one-liner I’ve used more times than I can count. It loops through all enabled regions and lists any EBS volumes in the ‘available’ state (meaning, not attached to an instance).

for region in $(aws ec2 describe-regions --query "Regions[].RegionName" --output text); do
  echo "Checking region: $region"
  aws ec2 describe-volumes --region $region --filters Name=status,Values=available --query "Volumes[].{ID:VolumeId, Size:Size, Created:CreateTime}" --output table
done

Once you find the Volume ID (e.g., vol-012345abcdef6789), you can inspect it, snapshot it just in case, and then terminate it. The charge will be gone from the next bill. You’ve won the battle, but not the war.

Solution 2: The Permanent Fix – “Building the Fence”

Okay, you’ve killed the ghost. Now you need to make sure its friends can’t get in. This is about policy and automation. You need to make it easy for developers to do the right thing and hard to do the wrong thing.

Enforce Tagging on Creation

The single most effective thing you can do is enforce tagging. If a resource can’t be created without an owner-email and project-name tag, you’ll never have an orphan again. You can do this with Service Control Policies (SCPs) at the AWS Organization level or with IAM policies.

Here’s an example IAM policy that forces a user to add those two tags when creating an EC2 instance:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowInstanceCreationWithTags",
            "Effect": "Allow",
            "Action": "ec2:RunInstances",
            "Resource": "*"
        },
        {
            "Sid": "EnforceTaggingOnCreation",
            "Effect": "Deny",
            "Action": "ec2:RunInstances",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Condition": {
                "Null": {
                    "aws:RequestTag/owner-email": "true",
                    "aws:RequestTag/project-name": "true"
                }
            }
        }
    ]
}

Pro Tip: Your most valuable resource is your engineer’s time. Spending a day setting up a good tagging policy will save you hundreds of hours of detective work over the next year. Don’t spend $500 of engineering time to save a one-off $50 charge.

Solution 3: The ‘Nuclear’ Option – “Automated Janitors”

Sometimes, even with policies, things slip through. Or maybe you have a huge, messy environment that’s too big to fix manually. It’s time to bring in the bots.

This is where you write automation—usually a Lambda function triggered on a schedule (e.g., every night)—that actively hunts for and terminates non-compliant resources.

The Strategy The Implementation
Untagged Resource Hunter A Lambda function scans for resources (like EC2, RDS, EBS) that are missing a mandatory ttl (time-to-live) or owner tag. It first adds a ‘pending-deletion’ tag and notifies the owner (if possible) or a Slack channel.
The Reaper A second Lambda runs 24 hours later. It looks for anything with the ‘pending-deletion’ tag and terminates it. This gives someone a grace period to claim their resource and make it compliant.

Warning: Be extremely careful here. Start in a development account. Run in “dry-run” mode for a week, logging what you would have deleted. Build in allow-lists for critical infrastructure that might not have tags (e.g., legacy systems). You do not want to be the engineer who writes a script that deletes prod-db-01.

Ultimately, that $42 charge isn’t the problem. It’s a health check for your cloud governance. If you’re spending 30 minutes chasing it, you’re not just wasting time; you’re discovering a flaw in your process. Fix the process, and the ghosts will stop appearing.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ What are the primary causes of unexpected cloud charges?

Unexpected cloud charges typically arise from resource sprawl (forgotten test instances/volumes), lack of resource ownership due to missing tags, and overly permissive IAM policies allowing unmonitored resource creation.

âť“ How do manual chargeback investigations differ from automated cloud cost governance?

Manual investigations are reactive, time-consuming efforts to find specific charges. Automated governance, through enforced tagging and ‘janitor’ bots, offers a proactive, systemic approach to prevent resource sprawl and ensure cost accountability.

âť“ What is a critical risk when deploying automated resource termination bots?

A critical risk is inadvertently deleting production resources. Mitigate this by starting in development environments, using ‘dry-run’ modes, implementing allow-lists for critical infrastructure, and providing grace periods with notifications.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading