🚀 Executive Summary
TL;DR: AWS Network Load Balancers (NLB) now only publish DNS records for Availability Zones (AZs) with healthy targets, silently breaking traditional cold/warm standby Disaster Recovery (DR) plans. To mitigate, implement a ‘sacrificial lamb’ instance, a ‘warm standby’ environment, or leverage Route 53 DNS Failover to ensure continuous DNS resolution during failover events.
🎯 Key Takeaways
- AWS NLB DNS resolution silently changed to only publish IPs for AZs containing at least one healthy registered target, unlike the old behavior which published for all enabled AZs.
- This change breaks common cold or warm standby DR patterns where standby AZs have NLBs enabled but no active targets, leading to NLB DNS de-registration during primary AZ failure.
- Effective solutions include maintaining a ‘sacrificial lamb’ instance, running a ‘warm standby’ scaled-down application, or using Route 53 DNS Failover for robust DR.
A recent, unannounced change in how AWS Network Load Balancers (NLB) handle DNS for empty Availability Zones is breaking common Disaster Recovery plans. Here’s a breakdown of the problem and three practical ways to fix it.
That ‘Silent’ NLB Change That Wrecked Your Failover Plan
It was 3 AM on a Tuesday, of course, when PagerDuty started screaming. Our primary services in us-east-1a were down hard. A provider issue. “No problem,” I mumbled to my confused dog, “this is what we drill for.” We have a robust Disaster Recovery plan with everything ready to spin up in us-east-1b. I triggered the failover, our automation kicked in to bring up the DR instances, and… nothing. Silence. Curls against our main endpoint, api.techresolve.com, were failing DNS resolution. Our NLB, the highly-available, unbreakable front door to our entire infrastructure, had effectively vanished from the internet. That was the night I learned a painful lesson about a subtle, but critical, change in how NLBs work.
The “Why”: What Actually Changed?
This isn’t a bug; it’s a feature with terrible side effects for a common DR pattern. The behavior change boils down to this:
- The Old Way: An NLB’s DNS name used to resolve to the IP addresses of the NLB nodes in all Availability Zones you enabled for it, regardless of whether there were healthy targets in those AZs.
- The New Way: An NLB’s DNS name now only resolves to the IPs of NLB nodes in AZs that have at least one healthy registered target.
You see the problem? Many of us, myself included, used a “cold” or “warm” standby model. Our primary AZ (e.g., us-east-1a) would have a dozen healthy targets. Our DR AZ (us-east-1b) would have the NLB enabled, but with zero registered targets, waiting for us to spin them up during a disaster. When our primary AZ went down, all its targets became unhealthy. Since our DR AZ had no targets yet, the NLB suddenly had no AZs with healthy targets. As a result, AWS de-registered all IPs from the NLB’s public DNS record. Poof. Gone.
Pro Tip: Don’t let cross-zone load balancing confuse you here. Cross-zone load balancing is about distributing traffic from a healthy NLB node in one AZ to targets in another AZ. It has absolutely no bearing on this DNS publishing behavior. If an AZ has zero healthy targets, its NLB node IP will be pulled from public DNS, period.
The Fixes: From Duct Tape to Bedrock
Alright, enough doom and gloom. How do we fix this? I’ve seen a few patterns emerge, from the quick-and-dirty to the architecturally sound.
Solution 1: The ‘Sacrificial Lamb’ (The Quick Fix)
This is the fastest, cheapest way to get back to a working state. The idea is to keep a single, tiny, cheap instance in your DR target group to act as a placeholder.
- Launch a
t4g.nanoort3.nanoinstance in your DR availability zone (e.g.,us-east-1b). - Install a basic web server on it that will always respond to the health check (e.g., Nginx serving a static “OK” page).
- Register this single instance with your DR target group.
Now, your DR AZ always has one “healthy” target. The NLB will keep the IP for the us-east-1b node published in DNS. When a real failover happens, your automation just adds the new, powerful instances (like your prod-db-01-dr) to the same target group. It’s a bit of a hack, but it’s effective and costs pennies per day.
Solution 2: The ‘Warm Standby’ (The Proper Fix)
This is the approach we ultimately adopted at TechResolve. It costs a bit more, but it’s cleaner and reduces your Recovery Time Objective (RTO).
Instead of a placeholder, you run a scaled-down version of your actual application in the DR AZ at all times. For example, if your primary AZ runs 10 servers, your DR AZ might run 1 or 2 active servers handling a tiny fraction of traffic (or just sitting idle but healthy).
Your failover process then becomes about scaling up the existing DR environment, not creating it from scratch. This ensures you always have healthy targets in every AZ, so the NLB DNS is always fully populated. This is the way AWS architects would likely recommend, as it keeps all AZs “warm” and ready to take significant load.
Solution 3: The ‘Route 53 Detour’ (The Nuclear Option)
If you want the most resilient (and most complex) setup, you can abstract the problem away from the NLB entirely using Route 53 DNS Failover.
The architecture looks like this:
- Create a Route 53 health check that targets an endpoint on your application through the NLB. This is critical. Don’t just ping an instance; make sure the entire path is being tested.
- Create a Route 53 DNS record (e.g.,
api.techresolve.com) as a Failover record set. - The Primary record is an Alias pointing to your NLB, associated with the health check you created.
- The Secondary record can point to anything you want: another NLB in a different region, a static S3 website with a “maintenance” page, or a different application stack entirely.
Now, if the NLB becomes unresolvable because all its targets are unhealthy, the Route 53 health check will fail. After a few failed checks, Route 53 will automatically flip the public DNS to point to your secondary target. This is the ultimate safety net, as it doesn’t depend on NLB DNS behavior at all, but it adds another layer of complexity to manage.
# Example Terraform snippet for a Route 53 Health Check
resource "aws_route53_health_check" "api_endpoint_check" {
fqdn = "nlb-prod-1234567890abcdef.elb.us-east-1.amazonaws.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = "3"
request_interval = "30"
tags = {
Name = "health-check-for-prod-nlb"
}
}
Conclusion: Test Your Assumptions
This whole incident was a stark reminder: the cloud is not magic. It’s a platform built by humans, with features that change and evolve. What worked last year might silently break your recovery plan today. Don’t just test your failover scripts; actively test the underlying assumptions you’ve made about your cloud provider’s services. Spin up a test environment, kill all your targets in one AZ, and run a dig command. You might be surprised by what you find.
| Solution | Cost | Complexity | Failover Speed |
|---|---|---|---|
| 1. Sacrificial Lamb | Very Low | Low | Medium |
| 2. Warm Standby | Medium | Medium | Fast |
| 3. Route 53 Detour | Low (Route 53 cost) | High | Fastest (DNS-based) |
🤖 Frequently Asked Questions
âť“ What is the recent silent change in AWS NLB DNS publishing behavior?
Previously, NLB DNS resolved to IPs in all enabled AZs; now, it only resolves to IPs in AZs that have at least one healthy registered target.
âť“ How do the ‘Sacrificial Lamb,’ ‘Warm Standby,’ and ‘Route 53 Detour’ solutions compare for NLB DR?
The ‘Sacrificial Lamb’ is very low cost, low complexity, and medium speed. ‘Warm Standby’ is medium cost, medium complexity, and fast speed. ‘Route 53 Detour’ is low cost (Route 53), high complexity, and offers the fastest (DNS-based) failover.
âť“ What is a common pitfall when designing Disaster Recovery with AWS NLBs, and how can it be avoided?
A common pitfall is assuming NLB DNS will remain resolvable for AZs without healthy targets. This can be avoided by ensuring every enabled AZ always has at least one healthy target (e.g., a ‘sacrificial lamb’ instance or a ‘warm standby’ application) or by implementing Route 53 DNS Failover.
Leave a Reply