🚀 Executive Summary

TL;DR: Cloud cost optimization often defaults to reactive, manual deletion of orphaned snapshots due to a lack of ownership and automation. Effective strategies involve immediate cleanup scripts, implementing automated lifecycle management and tagging policies, and fostering a cultural shift towards shared FinOps responsibility to prevent waste proactively.

🎯 Key Takeaways

  • A Python script using Boto3 can be used for immediate triage to identify and list old, untagged snapshots (e.g., older than 90 days without a ‘Project’ tag) for cleanup.
  • Permanent fixes involve enforcing tagging policies (owner, project, ttl) via IAM/SCP, utilizing AWS Data Lifecycle Manager (DLM) for scheduled backups, and deploying a ‘Reaper’ Lambda function to automatically delete snapshots based on their ‘ttl’ tag.
  • The most impactful solution is a FinOps culture shift, promoting shared responsibility by using cost allocation tags, providing cost visibility to teams, and integrating cost accountability into development processes.

Is it just me, or has

Tired of “cloud cost optimization” just being a frantic snapshot cleanup? A senior DevOps lead explains why this happens and provides three real-world solutions, from quick scripts to permanent cultural fixes.

Is “Cloud Cost Optimization” Just a Lazy Game of Deleting Old Snapshots?

I still remember the Monday morning meeting from a few years back. Our cloud bill had jumped by an extra $8,000 that month, and the finance team wanted answers. I spent the next four hours digging through Cost Explorer, and the culprit was as mundane as it was infuriating: thousands of orphaned EBS snapshots. They came from a frantic migration project six months prior. The engineers who created them had since moved to other teams, leaving these tiny, accumulating costs behind like digital breadcrumbs. It felt less like engineering and more like digital archaeology. That’s when I realized the Reddit thread I saw last week hit the nail on the head; for too many teams, “cost optimization” has become a reactive, soul-crushing game of Whac-A-Mole with old snapshots.

So, Why Does This Keep Happening?

It’s simple: a fundamental lack of ownership and automation. Snapshots are the ultimate “fire-and-forget” resource. A developer is about to do something risky to prod-db-01. What’s the first thing they do? “I’ll just take a quick snapshot, just in case.” That “just in case” moment is where the problem starts. There’s rarely a ticket, no formal process, and certainly no automated cleanup schedule attached to that one-off action. It’s a cheap insurance policy at the moment, but hundreds of these policies, forgotten over months, add up to a significant premium on your bill.

The root cause isn’t malice; it’s friction. The path of least resistance is to create the snapshot and move on. The path of creating a lifecycle policy, setting a reminder, or adding tags feels like extra work when you’re under pressure to ship a feature. So, the snapshots sit there, detached from their original purpose, their parent AMI, and the person who created them.

Three Levels of Fighting Back

You can’t just tell people to “be better.” You need to implement systems. Here’s my playbook, from the immediate triage to the long-term cure.

1. The Triage: The “Stop the Bleeding” Script

This is the quick and dirty fix. You’ve got a budget meeting tomorrow, and you need to show you’re taking action. You write a script to find the most obvious offenders: old, untagged snapshots.

Pro Tip: Always run these scripts in dry-run mode first! You do not want to be the person who accidentally deletes the only backup of a critical pre-production database because your script was too aggressive.

Here’s a basic Python script using Boto3 for AWS that I’ve used a dozen times. It looks for snapshots older than 90 days that don’t have a ‘Project’ tag.


import boto3
from datetime import datetime, timedelta, timezone

def find_old_untagged_snapshots():
    ec2 = boto3.client('ec2')
    ninety_days_ago = datetime.now(timezone.utc) - timedelta(days=90)
    
    # Get snapshots owned by you
    response = ec2.describe_snapshots(OwnerIds=['self'])
    
    candidates_for_deletion = []

    for snapshot in response['Snapshots']:
        snapshot_time = snapshot['StartTime']
        
        # Check if the snapshot is older than 90 days
        if snapshot_time < ninety_days_ago:
            tags = {tag['Key']: tag['Value'] for tag in snapshot.get('Tags', [])}
            
            # Check if the 'Project' tag is missing
            if 'Project' not in tags:
                candidates_for_deletion.append({
                    'SnapshotId': snapshot['SnapshotId'],
                    'StartTime': snapshot_time.strftime('%Y-%m-%d'),
                    'Size': snapshot['VolumeSize']
                })

    print("Found the following snapshots to review for deletion:")
    for cand in candidates_for_deletion:
        print(f"  - ID: {cand['SnapshotId']}, Created: {cand['StartTime']}, Size: {cand['Size']}GB")
        
    # To actually delete, you would uncomment the following lines:
    # for cand in candidates_for_deletion:
    #     print(f"Deleting {cand['SnapshotId']}...")
    #     ec2.delete_snapshot(SnapshotId=cand['SnapshotId'])

if __name__ == '__main__':
    find_old_untagged_snapshots()

This is a band-aid, but it's an effective one. It lets you clean up the existing mess and quantify the scale of the problem.

2. The Permanent Fix: Automation and Guardrails

Running scripts manually isn't a strategy; it's a chore. The real fix is to automate the process and make it impossible (or at least very difficult) to create these orphaned resources in the first place.

  • Enforce a Tagging Policy: This is non-negotiable. Every snapshot must have a minimum set of tags: owner, project, and ttl (time-to-live, e.g., '7d' for 7 days). You can enforce this using IAM Policies or AWS Service Control Policies (SCPs) that deny the ec2:CreateSnapshot action if the required tags aren't present.
  • Use AWS Data Lifecycle Manager (DLM): For any snapshots that are part of a regular backup schedule (e.g., daily snaps of prod-web-app-01), stop creating them manually or with cron jobs. Use DLM. You can define policies that say "create a snapshot every 24 hours, keep 7 daily snapshots, 4 weekly, and 1 monthly." It automates both creation and deletion.
  • The "Reaper" Lambda Function: For the one-off snapshots, you need an automated janitor. Create a simple Lambda function triggered by a CloudWatch Event (e.g., runs daily at midnight). This function scans all snapshots, reads the ttl tag, and if the snapshot is older than its TTL, it's deleted. No humans required.

3. The Culture Shift: Shared Responsibility (The FinOps Way)

This is the hardest but most impactful solution. The problem isn't really a technical one; it's an organizational one. As long as developers can create resources and DevOps/Finance is responsible for the cost, the cycle will continue.

Warning: This requires buy-in from leadership. You can't just declare this on your own. It's a significant change in how teams operate.

The goal is to shift the cost ownership to the people creating the resources.

  • Visibility: Use cost allocation tags (`project`, `team`) for everything. Create dashboards (in your cloud provider's console or a third-party tool) that show each team their specific contribution to the cloud bill. When a developer sees that their un-deleted test environment is costing their team $50 a day, they're much more likely to clean it up.
  • Accountability: Make cloud cost a part of the conversation during sprint planning and reviews. It should be treated as a non-functional requirement, just like performance or security. Ask the question: "What is the full lifecycle cost of this feature?"
  • Empowerment: Give developers the tools and permissions they need to manage their own resources and clean up after themselves. The "Reaper" Lambda from the previous step is a great safety net, but the primary responsibility should lie with the resource owner.

Choosing Your Battle

So where do you start? Here’s how I think about it.

Solution Effort Immediate Impact Long-Term Value
1. The Triage Script Low High Low (It's a recurring task)
2. Automation & Guardrails Medium Medium High (Solves the technical problem)
3. Culture Shift (FinOps) High Low Very High (Solves the root problem)

My advice? Start with #1 today to make a dent. Immediately start working on #2, because that's the real engineering solution. And begin planting the seeds for #3 by talking to your manager and other team leads. True cloud cost optimization isn't about frantically deleting things once a month; it's about building a system and a culture where waste isn't created in the first place.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ What causes excessive cloud costs from snapshots?

Excessive cloud costs from snapshots primarily stem from a fundamental lack of ownership and automation, leading to 'fire-and-forget' creation without lifecycle management or cleanup processes.

âť“ How do automated snapshot management solutions compare to manual cleanup efforts?

Manual cleanup is a reactive, recurring chore with low long-term value. Automated solutions like AWS DLM and 'Reaper' Lambda functions provide a proactive, permanent fix by enforcing policies and automating creation and deletion, significantly reducing ongoing effort and cost.

âť“ What is a common implementation pitfall when optimizing snapshot costs?

A common pitfall is accidentally deleting critical snapshots by running aggressive cleanup scripts without prior dry-run testing or careful validation of deletion candidates, which can lead to data loss. Always test scripts thoroughly.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading