🚀 Executive Summary
TL;DR: Unattached Amazon EBS volumes often lead to a “slow burn” on cloud bills, accruing charges after their associated EC2 instances are terminated due to default AWS settings. This issue can be resolved by a three-tiered approach: manual identification and deletion, enforcing `delete_on_termination = true` in Infrastructure as Code, and implementing an automated Lambda function to clean up orphan volumes.
🎯 Key Takeaways
- Additional EBS data volumes do not default to `delete_on_termination=true` when attached to an EC2 instance, leading to orphan volumes and unexpected costs after instance termination.
- Enforcing `delete_on_termination = true` for all relevant EBS block devices within Infrastructure as Code (e.g., Terraform) is the most effective preventive measure against phantom usage.
- An automated ‘janitor’ solution, such as an AWS Lambda function triggered by EventBridge, can periodically identify and delete ‘available’ EBS volumes older than a set threshold, acting as a crucial safety net.
Uncover the hidden costs of detached EBS volumes in your AWS bill. This guide, inspired by a real-world scenario, provides actionable steps for DevOps engineers to identify, eliminate, and prevent this “slow burn” of phantom cloud usage.
The “Slow Burn” Cloud Bill: How We Finally Tracked Down Phantom Usage
I remember the first time this happened to me. It was a Tuesday. Our finance lead, a wonderful but very direct woman named Carol, walked over to my desk, tablet in hand, pointing at a line item on our AWS bill. “Darian,” she said, “Why are we paying for 5TB of EBS storage we’re not using? The instance count is down, but this number keeps creeping up.” I waved it off initially. “Probably just snapshots, Carol, I’ll check it.” Famous last words. What followed was a multi-day goose chase that taught me a valuable lesson about the ghosts in the machine—the resources you pay for long after you think they’re gone.
The “Why”: The Ghost of Instances Past
Here’s the root of the problem, and it’s a classic “feature, not a bug” situation. When you spin up an Amazon EC2 instance, you attach an Elastic Block Store (EBS) volume as its root drive. By default, AWS sets this root volume to “Delete on Termination.” Great. But any additional data volumes you attach? Their default setting is the opposite: they are NOT deleted when the instance is terminated.
Think about it. A junior engineer spins up `temp-data-proc-03` for a quick test, attaches a 500 GB gp3 volume, runs their job, and then terminates the instance. They assume they cleaned up. But that 500 GB volume is now sitting in your account, unattached to anything, quietly accruing charges month after month. It’s an orphan. Now, multiply that by a team of 20 engineers over two years. That’s how you get a slow burn that turns into a five-alarm fire on your bill.
The Fixes: From Firefighting to Fire Prevention
Okay, so you’ve confirmed you have a legion of phantom volumes. Let’s get this sorted. I approach this with a three-tiered strategy.
1. The Quick Fix: Manual Search and Destroy
First, you need to stop the bleeding. This is the manual, in-the-trenches work of finding and eliminating the orphan volumes right now. You can do this in the AWS Console by going to the EC2 dashboard, navigating to “Volumes,” and filtering by “State: Available.” But let’s be real, the CLI is faster.
Here’s the command I use to list all “available” (i.e., unattached) volumes in a specific region. It outputs a neat table with the Volume ID, Size, and Creation Date so you can see how old these ghosts are.
aws ec2 describe-volumes --region us-east-1 --filters Name=status,Values=available --query "Volumes[*].{ID:VolumeId,Size:Size,CreateTime:CreateTime}" --output table
Once you have the list, you review it carefully (don’t just blindly delete everything!) and then use the `aws ec2 delete-volume –volume-id vol-xxxxxxxxxxxxxxxxx` command to terminate them one by one. It’s tedious, but it gives you an immediate win.
Pro Tip: Before you delete, double-check that no one needs the data. A good policy is to snapshot the volume first, tag the snapshot with a “pending-deletion” date, and then delete the volume itself. It’s cheap insurance against accidentally nuking something important.
2. The Permanent Fix: Infrastructure as Code Discipline
Cleaning up is great, but preventing the mess is better. The real fix is to enforce the correct configuration when resources are created. If you’re using Infrastructure as Code (and you should be), this is straightforward.
When defining an EC2 instance, you can specify that any attached EBS volumes should be deleted when the instance is terminated. Here’s what that looks like in a Terraform block:
resource "aws_instance" "web_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
# ... other config ...
ebs_block_device {
device_name = "/dev/sdh"
volume_size = 50
# THIS IS THE MAGIC LINE
delete_on_termination = true
}
}
By setting `delete_on_termination = true`, you’re telling AWS, “When this instance dies, take its baggage with it.” This should be the default for 95% of use cases, especially in dev and test environments. Mandate this in your team’s pull request reviews and module standards. This is how you build a resilient, cost-effective system.
3. The ‘Nuclear’ Option: The Automated Janitor
Sometimes, despite your best efforts, things slip through. Manual console actions happen. Old, unmanaged accounts exist. For this, you need a janitor—an automated process that cleans up for you.
This is where I’ll set up a simple AWS Lambda function, written in Python or Go, that runs on a schedule (say, every night at 2 AM) via an Amazon EventBridge trigger. The function’s job is simple:
- Use the AWS SDK to call `describe-volumes` with the filter for “available” status.
- Loop through the list of orphan volumes.
- For any volume that has been “available” for more than a set period (e.g., 7 days), the Lambda can either:
- The Soft Approach: Add a tag like `autoclean_candidate: true`. This lets you review before deletion.
- The Hard Approach: Just call `delete-volume` and get rid of it. Log the action to CloudWatch Logs for auditing.
Is this approach a bit heavy-handed? Yes. It’s a “hacky” solution to a discipline problem. But in large, chaotic environments, it’s an incredibly effective safety net that has saved my teams thousands of dollars. It ensures that even when people make mistakes, the financial damage is contained.
Ultimately, tackling the slow burn isn’t about one magic bullet. It’s about combining immediate action, long-term policy, and automated safeguards. Now go check your EBS volumes—Carol might be coming for you next.
🤖 Frequently Asked Questions
❓ How do I find and delete unattached EBS volumes in AWS?
You can find unattached EBS volumes by filtering for ‘State: Available’ in the AWS EC2 Console under ‘Volumes’, or more efficiently using the AWS CLI command: `aws ec2 describe-volumes –region us-east-1 –filters Name=status,Values=available –query “Volumes[*].{ID:VolumeId,Size:Size,CreateTime:CreateTime}” –output table`. Once identified, delete them using `aws ec2 delete-volume –volume-id vol-xxxxxxxxxxxxxxxxx`.
❓ What are the different strategies for managing EBS volume lifecycle and cost, and how do they compare?
The article outlines three strategies: manual ‘search and destroy’ for immediate cost reduction, Infrastructure as Code (IaC) with `delete_on_termination = true` for proactive prevention, and an ‘automated janitor’ Lambda function for continuous cleanup. Manual is reactive and labor-intensive, IaC is the most robust preventive measure, and automation acts as a crucial safety net for environments where IaC isn’t fully enforced or manual actions occur.
❓ What is a common pitfall when trying to prevent orphan EBS volumes, and how can it be avoided?
A common pitfall is assuming all EBS volumes will be deleted with an EC2 instance. By default, only the root volume has `delete_on_termination = true`. To avoid this, explicitly set `delete_on_termination = true` for all additional data volumes within your Infrastructure as Code definitions (e.g., Terraform `ebs_block_device` blocks) to ensure they are automatically removed upon instance termination.
Leave a Reply