Solved: Auto-Scale AWS ASG based on Custom SQS Queue Length

🚀 Executive Summary

TL;DR: The article addresses the challenge of efficiently scaling AWS Auto Scaling Groups (ASGs) for SQS job processing by eliminating manual intervention and over-provisioning. It details a solution to automatically scale worker nodes based on the actual SQS queue length, ensuring cost-effectiveness and system resilience.

🎯 Key Takeaways

A Python Lambda function, triggered by EventBridge (CloudWatch Events) on a schedule (e.g., `rate(1 minute)`), is used to retrieve the `ApproximateNumberOfMessages` from an SQS queue and publish it as a custom `SQSQueueDepth` metric in CloudWatch under the `CustomSQSMetrics` namespace.
The Lambda function requires an IAM role with `sqs:GetQueueAttributes` for the specific SQS queue, `cloudwatch:PutMetricData` for publishing metrics, and `logs:*` for logging, adhering to the principle of least privilege.
CloudWatch alarms are configured on the custom `SQSQueueDepth` metric (e.g., ‘Greater/Equal 100’ for scale-up, ‘Less/Equal 10’ for scale-down) and linked to ASG simple scaling policies with specific cooldown periods (e.g., 300s for scale-up, 600s for scale-down) and multiple evaluation periods (e.g., ‘2 out of 3’ for up, ‘5 out of 5’ for down) to prevent ‘alarm flapping’.

Auto-Scale AWS ASG based on Custom SQS Queue Length

Hey team, Darian Vance here. Let’s talk about a classic DevOps headache: scaling worker nodes that process SQS jobs. For a long time, I was either over-provisioning instances “just in case” of a spike, or worse, manually checking logs and queue depths to scale up during a busy period. It was a huge time sink. Setting up this automated flow saved me hours a week and made our system far more resilient and cost-effective. Let’s build it.

Prerequisites

Before we dive in, make sure you have the following ready to go:

An AWS account with permissions to manage IAM, Lambda, CloudWatch, SQS, and Auto Scaling Groups.
An existing SQS Standard Queue that your workers process.
An existing Auto Scaling Group (ASG) that you want to scale.
Basic comfort with Python and the AWS Management Console.

The Step-by-Step Guide

Step 1: The IAM Role – Giving Our Lambda Permissions

First things first, our Lambda function needs permission to talk to other AWS services. It needs to read attributes from SQS and write metrics to CloudWatch. The principle of least privilege is key here.

1. Navigate to the IAM service in your AWS Console.
2. Go to **Roles** and click **Create role**.
3. Select **AWS service** as the trusted entity type, and choose **Lambda** as the use case.
4. On the permissions page, click **Create policy**. This will open a new tab.
5. In the JSON editor, paste the following policy. This grants exactly what we need and nothing more. Remember to replace `YOUR_AWS_REGION`, `YOUR_AWS_ACCOUNT_ID`, and `YOUR_QUEUE_NAME` with your specific values.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSQSRead",
            "Effect": "Allow",
            "Action": "sqs:GetQueueAttributes",
            "Resource": "arn:aws:sqs:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT_ID:YOUR_QUEUE_NAME"
        },
        {
            "Sid": "AllowCloudWatchWrite",
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricData",
            "Resource": "*"
        },
        {
            "Sid": "AllowLogging",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

6. Name the policy something descriptive, like `SQS-to-CloudWatch-Metric-Policy`, and create it.
7. Go back to the role creation tab, refresh the policy list, and attach the policy you just created.
8. Name the role something like `SQSMetricPublisherRole` and finish creating it.

Step 2: The Python Lambda – Our Metric Publisher

This is the core of our solution. This function will run on a schedule, check the SQS queue for the number of messages waiting, and publish that number as a custom metric in CloudWatch.

I’ll skip the standard virtualenv setup since you likely have your own workflow for that. The good news is that the `boto3` library is already available in the standard AWS Lambda Python runtime, so we don’t need to package any dependencies for this simple script.

Here’s the Python code:

import os
import boto3

# Initialize clients
sqs = boto3.client('sqs')
cloudwatch = boto3.client('cloudwatch')

def lambda_handler(event, context):
    # Get the queue name from an environment variable for flexibility
    queue_name = os.environ.get('QUEUE_NAME')
    if not queue_name:
        print("Error: QUEUE_NAME environment variable not set.")
        return { 'statusCode': 500, 'body': 'QUEUE_NAME not set' }

    try:
        # Get the queue URL from its name
        queue_url_response = sqs.get_queue_url(QueueName=queue_name)
        queue_url = queue_url_response['QueueUrl']
        
        # Get the approximate number of messages
        attributes_response = sqs.get_queue_attributes(
            QueueUrl=queue_url,
            AttributeNames=['ApproximateNumberOfMessages']
        )
        
        message_count = int(attributes_response['Attributes']['ApproximateNumberOfMessages'])
        print(f"Queue: {queue_name}, Messages: {message_count}")

        # Publish the custom metric to CloudWatch
        cloudwatch.put_metric_data(
            Namespace='CustomSQSMetrics',
            MetricData=[
                {
                    'MetricName': 'SQSQueueDepth',
                    'Dimensions': [
                        {
                            'Name': 'QueueName',
                            'Value': queue_name
                        },
                    ],
                    'Value': message_count,
                    'Unit': 'Count'
                },
            ]
        )
        
        return { 'statusCode': 200, 'body': f'Successfully published metric: {message_count}' }

    except Exception as e:
        print(f"An error occurred: {e}")
        # It's important to return here to avoid further execution on error.
        return { 'statusCode': 500, 'body': str(e) }

Step 3: Deploy and Trigger the Lambda

Now, let’s get that code into AWS and set it up to run automatically.

1. In the AWS Lambda console, create a new function from scratch.
2. Give it a name, like `publish-sqs-depth-metric`.
3. Choose a Python runtime (e.g., Python 3.9).
4. Under **Permissions**, choose “Use an existing role” and select the `SQSMetricPublisherRole` we created in Step 1.
5. Once created, paste the Python code into the `lambda_function.py` editor.
6. Go to the **Configuration** tab, then **Environment variables**. Add a variable with the key `QUEUE_NAME` and the value being the name of your SQS queue.
7. Finally, we need a trigger. Click **Add trigger**, select **EventBridge (CloudWatch Events)**, and create a new rule. Configure it as a **Schedule expression** with a rate like `rate(1 minute)`. This will execute our function every minute.

Step 4: Create CloudWatch Alarms

With our metric flowing into CloudWatch, we can now create alarms that will trigger our scaling actions. We need two: one to scale up when the queue is busy, and one to scale down when it’s quiet.

1. Navigate to CloudWatch > Alarms.
2. Click **Create alarm** and select your metric. You’ll find it under the custom namespace `CustomSQSMetrics`.
3. **Create the Scale-Up Alarm:**

**Metric name:** SQSQueueDepth
**Statistic:** Average
**Period:** 1 Minute
**Threshold type:** Static
**Condition:** Whenever SQSQueueDepth is… Greater/Equal
**Threshold:** Choose a value that means “we are getting busy”. For example, `100`.
**Datapoints to alarm:** Set this to `2 out of 3` to prevent scaling on a brief, momentary spike.

4. **Create the Scale-Down Alarm:**

Follow the same steps, but with different logic.
**Condition:** Whenever SQSQueueDepth is… Less/Equal
**Threshold:** Choose a value that means “it’s safe to scale in”. For example, `10`.
**Datapoints to alarm:** Use a more conservative setting here, like `5 out of 5`, to ensure the queue is truly idle before removing capacity.

Pro Tip: Don’t configure any actions (like SNS notifications) on these alarms yet. We will link them directly to the ASG policies in the next step. Let the alarms exist in an ‘OK’ state first.

Step 5: Link Alarms to Your Auto Scaling Group

This is the final connection. We’ll tell the ASG to add or remove instances when our alarms change state.

1. Go to EC2 > Auto Scaling Groups and select your target ASG.
2. Click on the **Automatic scaling** tab.
3. Click **Create dynamic scaling policy**.
4. **Create the Scale-Up Policy:**

**Policy type:** Simple scaling
**Name:** `add-one-worker`
**CloudWatch alarm:** Select the scale-up alarm you created.
**Take the action:** Add `1` instances.
Set a **Cooldown period** (e.g., 300 seconds) to prevent the ASG from adding new instances too rapidly.

5. **Create the Scale-Down Policy:**

**Policy type:** Simple scaling
**Name:** `remove-one-worker`
**CloudWatch alarm:** Select the scale-down alarm.
**Take the action:** Remove `1` instances.
Give this a longer cooldown, perhaps 600 seconds, to be more conservative about removing capacity.

Common Pitfalls (Where I Usually Mess Up)

IAM Permissions: My number one mistake, every time. The Lambda runs but fails silently. The CloudWatch logs for the function will show an ‘AccessDenied’ error. Always double-check that cloudwatch:PutMetricData and sqs:GetQueueAttributes are in the policy attached to the Lambda’s execution role.
Metric Namespace/Name Mismatch: I once spent an hour debugging why my alarm never fired. I had named the metric ‘QueueLength’ in my Lambda but was looking for ‘SQSQueueDepth’ in the CloudWatch alarm setup. Be obsessively consistent with your naming!
Alarm Flapping: Setting your scale-up and scale-down thresholds too close together or making them too sensitive (e.g., 1 out of 1 datapoints) will cause ‘flapping’—the ASG will constantly scale up and down. Use multiple evaluation periods and give your scale-down alarm a longer period to smooth this out.

Conclusion

And that’s it. You’ve now decoupled your worker scaling from generic metrics like CPU or memory and tied it directly to the *actual workload*—the number of jobs waiting to be processed. This is more efficient, far more responsive, and significantly more cost-effective. In my production setups, this pattern is a non-negotiable lifesaver for any asynchronous job processing architecture. Happy scaling

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ How do I scale my AWS Auto Scaling Group based on SQS queue depth?

To scale an AWS ASG based on SQS queue depth, you deploy a Lambda function to periodically read the SQS queue’s `ApproximateNumberOfMessages` and publish it as a custom metric to CloudWatch. Then, you create CloudWatch alarms on this custom metric to trigger ASG scaling policies for adding or removing instances.

❓ How does this custom SQS-based scaling compare to default ASG scaling metrics like CPU utilization?

This custom SQS-based scaling is superior for asynchronous job processing because it directly scales based on the actual workload (messages waiting in the queue), rather than generic resource utilization like CPU or memory. This leads to more efficient, responsive, and cost-effective scaling, as instances are provisioned precisely when demand requires them.

❓ What are common pitfalls when implementing SQS-based ASG scaling?

Common pitfalls include incorrect IAM permissions for the Lambda function (missing `sqs:GetQueueAttributes` or `cloudwatch:PutMetricData`), mismatches in metric namespace or name between the Lambda and CloudWatch alarms, and ‘alarm flapping’ caused by setting scale-up/scale-down thresholds too close or using overly sensitive evaluation periods.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply