Solved: Alert on High IOPS usage on AWS EBS Volumes

🚀 Executive Summary

TL;DR: This article addresses the challenge of identifying and alerting on high IOPS usage on AWS EBS volumes, which can lead to application performance degradation. It provides a Python script that leverages AWS CloudWatch to monitor EBS VolumeReadOps and VolumeWriteOps and sends proactive alerts via SNS when a configured IOPS threshold is exceeded.

🎯 Key Takeaways

IAM policies for EBS IOPS monitoring must adhere to the principle of least privilege, granting specific actions like ec2:DescribeVolumes, cloudwatch:GetMetricData, and sns:Publish.
The Python script utilizes Boto3 to query AWS/EBS VolumeReadOps and VolumeWriteOps metrics from CloudWatch, calculating average IOPS over a defined CHECK_PERIOD_MINUTES for ‘in-use’ volumes.
Proactive alerting is achieved by comparing the calculated average IOPS against a configurable IOPS_THRESHOLD and publishing detailed notifications to an AWS SNS Topic.

Alert on High IOPS usage on AWS EBS Volumes

Hey there, Darian Vance here. As a Senior DevOps Engineer at TechResolve, I’ve seen my fair share of production fires. One of the sneakiest culprits? A runaway process hammering an EBS volume, grinding an application to a halt. For a while, I was that person digging through CloudWatch dashboards *after* an incident. I probably wasted hours a week just manually checking metrics. That’s why I automated it. This simple, proactive alert setup has saved my team countless headaches by catching performance bottlenecks before they impact users. Let’s get this set up so you can reclaim some of that time, too.

Prerequisites

Before we dive in, make sure you have the following ready to go:

An AWS account with IAM permissions to create policies and users.
Python 3 installed on the machine where you’ll run the script.
An AWS SNS Topic created. This is where our alerts will be sent. Make sure you have its ARN handy.
Your AWS credentials configured (e.g., via environment variables or an IAM role).

The Guide: Step-by-Step

Step 1: IAM Permissions – The Gatekeeper

First things first, we need to give our script permission to talk to AWS. I always advocate for the principle of least privilege. Our script only needs to do three things: describe EBS volumes, get metrics from CloudWatch, and publish to SNS. Nothing more.

Create an IAM policy with the following JSON. I’ve named mine `EBS-IOPS-Monitor-Policy`.


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowVolumeAndMetricAccess",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeVolumes",
                "cloudwatch:GetMetricData"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowSnsPublish",
            "Effect": "Allow",
            "Action": "sns:Publish",
            "Resource": "arn:aws:sns:YOUR_REGION:YOUR_ACCOUNT_ID:YOUR_SNS_TOPIC_NAME"
        }
    ]
}

Remember to replace the `Resource` ARN in the `AllowSnsPublish` statement with your actual SNS Topic ARN. Attach this policy to the IAM user or role that will execute the script.

Step 2: The Python Script – The Brains of the Operation

Alright, let’s get to the core logic. I’ll skip the standard virtualenv setup since you likely have your own workflow for that. Just make sure you have the Boto3 library installed. You can typically do this by running `pip install boto3` in your terminal.

Here’s the full script. I’ve added comments to explain what each part is doing.


import os
import boto3
from datetime import datetime, timedelta

# --- Configuration ---
# I prefer using environment variables. It's cleaner than hardcoding.
# You can set these in your terminal or a config.env file.
AWS_REGION = os.getenv('AWS_REGION', 'us-east-1')
SNS_TOPIC_ARN = os.getenv('SNS_TOPIC_ARN')
IOPS_THRESHOLD = int(os.getenv('IOPS_THRESHOLD', '1000')) # Default to 1000 IOPS
CHECK_PERIOD_MINUTES = 5 # Check metrics from the last 5 minutes.

# --- Initialize AWS Clients ---
# Using the specified region from our config.
ec2_client = boto3.client('ec2', region_name=AWS_REGION)
cloudwatch_client = boto3.client('cloudwatch', region_name=AWS_REGION)
sns_client = boto3.client('sns', region_name=AWS_REGION)

def get_total_iops(volume_id):
    """Queries CloudWatch for the total IOPS (Read + Write) for a given volume."""
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(minutes=CHECK_PERIOD_MINUTES)
    
    # CloudWatch needs a specific query format. We are getting the sum of all
    # operations over our check period.
    response = cloudwatch_client.get_metric_data(
        MetricDataQueries=[
            {
                'Id': 'read_ops',
                'MetricStat': {
                    'Metric': {
                        'Namespace': 'AWS/EBS',
                        'MetricName': 'VolumeReadOps',
                        'Dimensions': [{'Name': 'VolumeId', 'Value': volume_id}]
                    },
                    'Period': CHECK_PERIOD_MINUTES * 60,
                    'Stat': 'Sum',
                },
                'ReturnData': True,
            },
            {
                'Id': 'write_ops',
                'MetricStat': {
                    'Metric': {
                        'Namespace': 'AWS/EBS',
                        'MetricName': 'VolumeWriteOps',
                        'Dimensions': [{'Name': 'VolumeId', 'Value': volume_id}]
                    },
                    'Period': CHECK_PERIOD_MINUTES * 60,
                    'Stat': 'Sum',
                },
                'ReturnData': True,
            },
        ],
        StartTime=start_time,
        EndTime=end_time
    )

    # Extract the values from the response. If there's no data, default to 0.
    read_values = response['MetricDataResults'][0]['Values']
    write_values = response['MetricDataResults'][1]['Values']
    
    total_read_ops = sum(read_values) if read_values else 0
    total_write_ops = sum(write_values) if write_values else 0
    
    # Calculate average IOPS over the period
    total_ops = total_read_ops + total_write_ops
    average_iops = total_ops / (CHECK_PERIOD_MINUTES * 60)
    
    return average_iops

def send_sns_alert(volume_id, iops_value):
    """Formats and sends an alert to the configured SNS topic."""
    subject = f"High IOPS Alert for EBS Volume: {volume_id}"
    message = (
        f"Alert: High IOPS detected on EBS Volume.\n\n"
        f"Volume ID: {volume_id}\n"
        f"Detected Average IOPS: {iops_value:.2f}\n"
        f"Threshold: {IOPS_THRESHOLD}\n"
        f"Region: {AWS_REGION}\n\n"
        f"Please investigate the EC2 instance attached to this volume."
    )
    
    print(f"Sending alert for volume {volume_id}...")
    sns_client.publish(
        TopicArn=SNS_TOPIC_ARN,
        Subject=subject,
        Message=message
    )

def main():
    """Main function to iterate through volumes and check their IOPS."""
    if not SNS_TOPIC_ARN:
        print("Error: SNS_TOPIC_ARN environment variable is not set. Cannot send alerts.")
        return

    print("Starting EBS IOPS check...")
    # I'm only checking 'in-use' volumes to be efficient. No point checking detached ones.
    paginator = ec2_client.get_paginator('describe_volumes')
    pages = paginator.paginate(Filters=[{'Name': 'status', 'Values': ['in-use']}])

    for page in pages:
        for volume in page['Volumes']:
            volume_id = volume['VolumeId']
            print(f"Checking volume: {volume_id}")
            
            try:
                avg_iops = get_total_iops(volume_id)
                if avg_iops > IOPS_THRESHOLD:
                    print(f"ALERT! Volume {volume_id} has high IOPS: {avg_iops:.2f}")
                    send_sns_alert(volume_id, avg_iops)
                else:
                    print(f"Volume {volume_id} is OK (IOPS: {avg_iops:.2f})")
            except Exception as e:
                print(f"Could not process volume {volume_id}. Error: {e}")
                
    print("EBS IOPS check complete.")

if __name__ == "__main__":
    main()

Pro Tip: Notice I used `paginator` for `describe_volumes`. If you have hundreds or thousands of EBS volumes in your account, a simple `describe_volumes` call might time out or miss volumes. The paginator handles the token-based pagination for you automatically. It’s a robust way to handle large environments.

Step 3: Configuration and Environment

The script pulls its configuration from environment variables. This is great for security and flexibility. Create a file named `config.env` in the same directory as your script. Do not commit this file to version control.

Your `config.env` file should look like this:


export AWS_REGION="us-east-1"
export SNS_TOPIC_ARN="arn:aws:sns:us-east-1:123456789012:MySnsAlertsTopic"
export IOPS_THRESHOLD="1500"

Before you run the script, you’ll need to load these variables. You can do this by running `source config.env` in your terminal session.

Step 4: Scheduling the Script

This script is only useful if it runs automatically. In my production setups, I often use a simple cron job for tasks like this. You could also set this up as an AWS Lambda function for a more “cloud-native” approach, but cron is perfectly fine.

To run this script every hour, you could add an entry like this:


0 * * * * python3 your_script_name.py

This tells the system to execute your Python script at the start of every hour. Adjust the frequency based on how critical your monitoring needs are.

Common Pitfalls (Where I Usually Mess Up)

IAM Permissions: The classic `AccessDeniedException`. If you see this, double-check that your IAM policy is correct and attached to the right user or role. Also, make sure the SNS Topic ARN in the policy matches the one you’re trying to publish to.
No CloudWatch Data: Sometimes CloudWatch doesn’t have data for the requested period, especially for newly created volumes or volumes with zero activity. The script handles this by defaulting to 0, but it’s something to be aware of if you’re not seeing expected results.
Throttling: If you have thousands of volumes, you might hit AWS API rate limits. The `paginator` helps, but for very large-scale checks, I’ve sometimes had to add a small `time.sleep(0.1)` inside the main loop to slow things down and be a good citizen.
Setting the Threshold Too Low: Be realistic with your `IOPS_THRESHOLD`. Set it too low, and you’ll get spammed with alerts for normal workload spikes. I recommend monitoring your key volumes for a day to understand their baseline before setting an aggressive threshold.

Conclusion

And that’s it! You now have a robust, automated system for monitoring EBS IOPS. This isn’t just about getting an alert; it’s about shifting from a reactive to a proactive mindset. By catching these issues early, you’re not just preventing downtime—you’re building a more reliable and performant system. Now, go enjoy that extra time you just saved.

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ How can I automatically alert on high IOPS usage for AWS EBS volumes?

You can use a Python script with Boto3 to periodically query CloudWatch for VolumeReadOps and VolumeWriteOps metrics across all ‘in-use’ EBS volumes. If the calculated average IOPS exceeds a predefined IOPS_THRESHOLD, an alert is published to an AWS SNS Topic.

❓ How does this custom script approach compare to using native CloudWatch Alarms?

While CloudWatch Alarms can monitor individual metrics, this custom script provides more flexibility for dynamic environments by iterating through all ‘in-use’ volumes and applying a single logic. It also allows for custom aggregation (sum of read/write ops) and a centralized alerting mechanism for multiple volumes without creating an alarm per volume.

❓ What are common issues to watch out for when implementing this EBS IOPS monitoring solution?

Common pitfalls include AccessDeniedException due to insufficient IAM permissions, lack of CloudWatch data for new or inactive volumes, potential AWS API throttling with many volumes (mitigated by paginators or time.sleep), and setting an IOPS_THRESHOLD too low, leading to excessive alerts.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply