🚀 Executive Summary

TL;DR: Unmanaged data transfers through AWS NAT Gateways can lead to significant, unexpected cloud bills due to per-GB data processing charges. Solutions involve immediate triage using VPC Flow Logs, establishing AWS Budgets billing alarms, and implementing architectural fixes like Gateway VPC Endpoints for services like S3 to keep traffic internal and eliminate egress costs.

🎯 Key Takeaways

  • AWS NAT Gateways incur both an hourly charge and a per-GB data processing charge for all internet-bound traffic from private subnets, which can lead to massive unexpected costs.
  • Gateway VPC Endpoints for services like S3 enable private, internal routing of traffic within the AWS network, bypassing the NAT Gateway and eliminating data processing charges for those specific service communications.
  • VPC Flow Logs are essential for diagnosing runaway NAT Gateway costs by identifying the specific source IP and instance responsible for high data egress, allowing for quick remediation.

Share your Cloud Cost Optimization / FinOps Case

A simple configuration oversight can silently drain your cloud budget. Learn how to diagnose and fix runaway AWS NAT Gateway costs with practical, real-world solutions that prevent thousand-dollar surprises.

That Time a $70 NAT Gateway Cost Us $7,000 Over a Weekend

I still remember the feeling in the pit of my stomach. It was a Monday morning, and I was scrolling through our weekend alerts over coffee when I saw it: a billing alarm from AWS CloudWatch. Not a small one, either. The forecast for our ‘dev’ account’s monthly bill had shot up from a predictable $10k to a projected $45k. In 48 hours.

After the initial panic (and a second, much stronger coffee), we traced the source. It wasn’t some massive, un-tagged EC2 fleet or a forgotten RDS instance. It was a humble, often-ignored networking component: the NAT Gateway. A junior data scientist, trying to download a massive public dataset for a new ML model, had inadvertently kicked off a multi-terabyte transfer from an EC2 instance sitting in a private subnet. Every single byte went out through our NAT Gateway, and AWS charged us for all of it. This little story isn’t unique; I saw a similar tale of woe on a Reddit FinOps thread just last week, which is what prompted me to write this down for my team, and now for you.

The “Why”: The Silent Killer in Your VPC

Before we dive into the fixes, you need to understand why this happens. It’s not a bug; it’s the system working as designed. When a resource with no public IP (like an EC2 instance in a private subnet) needs to talk to the internet, it sends its traffic through a NAT Gateway. You pay for two things:

  • An hourly charge for the gateway just existing.
  • A per-GB data processing charge for everything that flows through it.

That per-GB charge is the killer. A developer running apt-get update is noise. A data scientist pulling the entire ‘Common Crawl’ dataset is a financial disaster waiting to happen. The root cause isn’t just the data transfer; it’s a breakdown in architectural awareness. We put things in private subnets for security, but we forget that security has a cost if not managed correctly.

Pro Tip from the Trenches: If you take nothing else from this, go set up a billing alarm in AWS Budgets right now. Set a threshold that’s slightly above your normal daily spend. It’s your single best defense against a surprise five-figure bill.

The Fixes: From Triage to Architecture

Okay, so your bill is skyrocketing. What do you do? Here are the plays, from stopping the immediate bleeding to ensuring it never happens again.

1. The Quick Fix: Find It and Kill It

Your first priority is to stop the cash drain. You don’t have time for a full architectural review. You need to act.

The best tool for this is VPC Flow Logs. If you don’t have them on, turn them on now (they have a cost, but it’s tiny compared to this problem). Query your flow logs for traffic passing through your NAT Gateway’s Elastic Network Interface (ENI). You’re looking for the source IP that is sending an enormous amount of traffic.

Let’s say you find that 10.0.10.123 (the private IP of an instance named dev-data-processor-01) is the culprit. What next?

  1. SSH into the box: Find the offending process. Is it a Docker pull? A Wget command? A Python script? Kill it with pkill or kill -9. This is the simplest solution.
  2. Stop the instance: If you can’t isolate the process, just stop the instance from the AWS Console. You can deal with the “why” later. The bleeding is stopped.
  3. The “Hacky” Egress Move: I hesitate to even write this, but in a true emergency… if the instance must finish its download, you can temporarily move it to a public subnet and attach an Elastic IP. This routes its traffic through an Internet Gateway (IGW) instead, which has no data processing charge. THIS IS A SECURITY RISK. You are exposing the instance to the internet. Only do this if you understand the risks and can lock down the Security Group immediately.

2. The Permanent Fix: Smart Architecture with VPC Endpoints

The real, long-term solution is to avoid sending traffic to the public internet when you don’t have to. The most common culprit for these massive data transfers? Talking to AWS services, especially S3.

Your dev-data-processor-01 was probably pulling a dataset from a public S3 bucket. That traffic leaves your VPC, hits the S3 public endpoint on the internet, and comes back… all through your costly NAT Gateway.

The fix is to use a Gateway VPC Endpoint for S3. This is a magical little route table entry that tells your VPC, “Hey, for any traffic destined for S3 in this region, don’t go out to the internet. Send it over this private, internal path instead.”

The result? Traffic between your EC2 instance and S3 stays within the AWS network. It never touches the NAT Gateway. The data processing charge is $0.

Here’s how you’d define one in Terraform. It’s ridiculously simple for the money it saves:


resource "aws_vpc_endpoint" "private_s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
  route_table_ids = [aws_route_table.private.id]

  tags = {
    Name = "s3-gateway-endpoint"
  }
}

For other AWS services, you can use Interface Endpoints (which have a small hourly cost but are still usually cheaper than a NAT Gateway under heavy load). Architect your VPCs to keep traffic internal wherever possible.

3. The ‘Nuclear’ Option: Enforce with Network ACLs

Sometimes, you need a bigger hammer. Maybe you’re in a highly regulated environment or you’re running a lab for juniors where you can’t risk this happening again. This is where you use Network Access Control Lists (NACLs).

Think of Security Groups as a firewall for your instance, and NACLs as a firewall for your entire subnet. They are stateless and unforgiving. You can set up an outbound rule on your private subnets’ NACL that says:

  • DENY all outbound traffic (0.0.0.0/0).
  • ALLOW traffic to specific IPs or CIDR ranges you need (e.g., your corporate network for SSH, or the IP for a critical partner API).

This effectively cuts off general internet access from your private subnets. An instance can’t accidentally pull a 2TB file because the NACL will drop the packets at the subnet boundary. It’s a fantastic cost-control mechanism, but it’s also a great way to break applications that rely on unforeseen internet access for things like package updates or metadata services.

Warning: The NACL option is a sledgehammer. Deploy it with care. You will break something if you don’t audit your application’s outbound dependencies first. Use this to enforce a strict “no internet” policy, not as a casual cost-saving measure.

Comparison of Solutions

Solution Pros Cons When to Use
1. Quick Fix (Kill Process) Fastest way to stop billing; no infra changes. Doesn’t prevent recurrence; requires manual intervention. In the middle of a cost emergency. Your first move.
2. Permanent Fix (VPC Endpoints) Architecturally sound; drastically reduces cost; improves security. Requires planning and infrastructure-as-code changes. The standard, best-practice solution for all production VPCs.
3. Nuclear Option (NACLs) Guarantees no unexpected egress; ultimate cost control. Brittle; high risk of breaking applications; hard to manage. For high-security or sandbox environments where you must enforce a “no internet” policy.

In the end, we implemented VPC endpoints for S3 and DynamoDB across all our major VPCs. The “Great Weekend Billing Scare of Q3” became a legend and a powerful teaching moment. Don’t wait for your own war story. Check your architecture, set up your billing alarms, and keep an eye on that humble little NAT Gateway. It’s quiet, but it can have a very expensive bite.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ How can I prevent unexpected high costs from AWS NAT Gateways?

Prevent high NAT Gateway costs by setting up AWS Budgets billing alarms, using VPC Flow Logs to identify and stop excessive data transfers, and architecturally implementing Gateway VPC Endpoints for AWS services like S3 to keep traffic internal.

âť“ How do VPC Endpoints compare to NAT Gateways for accessing AWS services?

NAT Gateways route all internet-bound traffic from private subnets, incurring per-GB data processing charges. VPC Endpoints provide a private, internal path to specific AWS services (e.g., S3), bypassing the NAT Gateway and eliminating data processing costs for that service traffic.

âť“ What is a common pitfall when using Network ACLs for cost control, and how can it be avoided?

A common pitfall is using Network ACLs to deny all outbound traffic without first auditing application dependencies, which can break critical functionality. Avoid this by thoroughly mapping all necessary outbound connections or by prioritizing less disruptive architectural solutions like VPC Endpoints.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading