Solved: Alert on Idle EC2 Instances with Low CPU Utilization

🚀 Executive Summary

TL;DR: Manually identifying idle EC2 instances with low CPU utilization leads to unnecessary cloud spend. This guide provides a Python script leveraging AWS CloudWatch and EC2 APIs to automatically detect and report these ‘zombie’ instances, enabling proactive cost optimization.

🎯 Key Takeaways

The solution utilizes `boto3` to programmatically interact with AWS, specifically `cloudwatch:GetMetricStatistics` for `CPUUtilization` and `ec2:DescribeInstances` to list running instances.
Secure handling of AWS credentials is achieved by loading `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_REGION` from a `config.env` file using `python-dotenv`.
Automated daily execution of the Python script is managed via `cron` jobs, ensuring continuous monitoring and reporting of idle EC2 instances without manual intervention.

Alert on Idle EC2 Instances with Low CPU Utilization

Hey team, Darian Vance here. I remember spending the first hour of every Monday manually combing through CloudWatch dashboards, trying to spot “zombie” EC2 instances. It was a tedious, coffee-fueled ritual. The worst part? I’d often miss a few, and they’d sit there racking up costs all week. This simple script I’m about to walk you through saved me that headache and cut our unnecessary dev-environment spend by about 15% last quarter. It’s a quick win, and I want to share how we set it up.

Prerequisites

An AWS account with programmatic access (Access Key ID and Secret Access Key).
An IAM user or role with permissions for cloudwatch:GetMetricData and ec2:DescribeInstances.
Python 3 installed on the machine where you’ll run the script.
Familiarity with installing Python packages.

The Step-by-Step Guide

Step 1: Project Setup and Dependencies

Alright, let’s get our workspace ready. I’ll skip the standard virtual environment setup since you likely have your own workflow for that. The key is to isolate our dependencies. In your project directory, you’ll need to install a couple of Python libraries. You can grab them using pip; we’ll need boto3 (the AWS SDK for Python) and python-dotenv for handling our credentials securely.

Next, create a file named config.env in your project root. This is where we’ll store our secrets instead of hardcoding them into the script—a much safer practice.

Your config.env file should look like this:

AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY
AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY
AWS_REGION=us-east-1

Step 2: The Python Script – Configuration and Finding Idle Instances

Now for the main event. Create a Python file, let’s call it find_idle_instances.py. The script will perform three main tasks: load our configuration, query CloudWatch for CPU metrics, and then print a report of any instances that fall below our idle threshold.

Here’s the complete script. I’ve added comments to explain what each part does.

import boto3
import os
from dotenv import load_dotenv
from datetime import datetime, timedelta

def find_low_cpu_instances():
    # --- Configuration ---
    load_dotenv('config.env')
    region = os.getenv("AWS_REGION")
    
    # Define the threshold for what we consider "idle"
    CPU_THRESHOLD = 5.0  # Percentage
    
    # Define the time window to check (e.g., the last 24 hours)
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=1)

    # --- Initialize AWS Clients ---
    try:
        cloudwatch_client = boto3.client('cloudwatch', region_name=region)
        ec2_client = boto3.client('ec2', region_name=region)
    except Exception as e:
        print(f"Error creating boto3 clients: {e}")
        return

    idle_instances = []
    
    print("Fetching all running EC2 instances...")
    
    try:
        # Get all running instances first
        paginator = ec2_client.get_paginator('describe_instances')
        pages = paginator.paginate(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
        
        all_running_instances = []
        for page in pages:
            for reservation in page['Reservations']:
                for instance in reservation['Instances']:
                    all_running_instances.append(instance)

        print(f"Found {len(all_running_instances)} running instances. Now checking CPU utilization...")

        # --- Check Metrics for Each Instance ---
        for instance in all_running_instances:
            instance_id = instance['InstanceId']
            
            # Get the 'Name' tag for better reporting
            instance_name = "N/A"
            if 'Tags' in instance:
                for tag in instance['Tags']:
                    if tag['Key'] == 'Name':
                        instance_name = tag['Value']
                        break
            
            response = cloudwatch_client.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=start_time,
                EndTime=end_time,
                Period=86400,  # 24 hours in seconds, gives one datapoint (the average)
                Statistics=['Average'],
                Unit='Percent'
            )
            
            # If there's data, check it against the threshold
            if response['Datapoints']:
                average_cpu = response['Datapoints'][0]['Average']
                if average_cpu < CPU_THRESHOLD:
                    idle_instances.append({
                        'InstanceId': instance_id,
                        'InstanceName': instance_name,
                        'AverageCPU': f"{average_cpu:.2f}%"
                    })
                    print(f"  - Found potential idle instance: {instance_name} ({instance_id})")

    except Exception as e:
        print(f"An error occurred during AWS API call: {e}")
        return

    # --- Report Findings ---
    print("\n--- Idle Instance Report ---")
    if not idle_instances:
        print("No idle instances found meeting the criteria.")
    else:
        for idle_instance in idle_instances:
            print(f"ID: {idle_instance['InstanceId']}, Name: {idle_instance['InstanceName']}, Avg CPU (24h): {idle_instance['AverageCPU']}")
    print("--------------------------")
    
if __name__ == '__main__':
    find_low_cpu_instances()

Pro Tip: In my production setups, I don’t just print the results. I modify the reporting section to post a formatted message to a specific Slack channel using a webhook. This makes the alert immediately visible to the whole team. You could also have it create a Jira ticket automatically. The `idle_instances` list is perfectly structured for that kind of integration.

Step 3: Scheduling with Cron

A script like this is most valuable when it runs automatically. On a Linux-based system, a cron job is the perfect tool for this. You’ll want to edit your user’s crontab to add a new job.

To have the script run every day at 2 AM, you would add the following line. This example assumes your script is in your home directory’s `scripts` folder.

0 2 * * * python3 find_idle_instances.py

This simple line ensures you get a fresh report every morning without lifting a finger.

Common Pitfalls (Where I Usually Mess Up)

IAM Permissions: The number one issue is an `AccessDeniedException`. I’ve spent way too much time debugging only to realize my IAM role was missing either cloudwatch:GetMetricStatistics or ec2:DescribeInstances. Always check the policy first.
Ignoring Burstable Instances: Be careful with T-series instances (like t2, t3, t4g). They are designed to have low baseline CPU and “burst” when needed. An average of 2% might be perfectly normal for them. In more advanced versions of my script, I add logic to check the instance type and either ignore T-series or apply a much lower threshold (like 1%) to them.
API Rate Limiting: If you have thousands of instances, calling `get_metric_statistics` in a tight loop could get you throttled by the AWS API. The script above is fine for a few hundred instances, but for a massive fleet, I’d recommend using CloudWatch’s `GetMetricData` API, which can query up to 500 metrics in a single call. It’s more complex but much more efficient at scale.

Conclusion

And that’s really all there is to it. This is a foundational script that provides immediate value by helping you spot waste. You can build on it to check for low network I/O, unattached EBS volumes, or even automatically tag instances for review. It’s a small investment of time that pays off quickly. Hope this helps you reclaim some of your own time.

– Darian

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ How can I automatically detect idle EC2 instances to reduce costs?

You can implement a Python script using `boto3` to query AWS CloudWatch for `CPUUtilization` metrics and EC2 `describe_instances` to identify running instances. The script then reports instances whose average CPU utilization falls below a defined `CPU_THRESHOLD` over a specified time window, typically scheduled with `cron`.

❓ How does this compare to alternatives for identifying idle resources?

This custom script offers a highly specific and configurable method for CPU-based idle EC2 detection, providing more granular control than general AWS Cost Explorer recommendations. While it requires manual setup, it avoids the overhead and broader scope of third-party cloud cost management tools, focusing purely on a defined ‘idle’ state.

❓ What are common implementation pitfalls when setting up idle EC2 instance alerts?

Common pitfalls include `AccessDeniedException` due to insufficient IAM permissions (e.g., missing `cloudwatch:GetMetricStatistics` or `ec2:DescribeInstances`), misinterpreting low CPU for burstable T-series instances which have naturally low baselines, and encountering AWS API rate limiting when processing a very large number of instances, for which `GetMetricData` is a more efficient alternative.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply