🚀 Executive Summary
TL;DR: EKS node groups managed by Rancher frequently encounter ‘state drift,’ where Rancher’s cached view of the cluster diverges from the actual AWS state, causing updates to get stuck. The most effective and permanent solution involves adopting Infrastructure as Code (IaC) tools like Terraform to manage EKS infrastructure, establishing AWS as the single source of truth.
🎯 Key Takeaways
- State drift is the core problem, occurring when Rancher’s internal state cache for an imported EKS cluster becomes inconsistent with the actual configuration in the AWS API, leading to stuck updates.
- A temporary ‘quick fix’ for a stuck Rancher-managed EKS cluster involves navigating to its ‘Edit Config’ in the Rancher UI and simply clicking ‘Save’ without making changes, which often triggers a reconciliation loop.
- The permanent and recommended solution is to manage EKS cluster infrastructure, including node groups, exclusively through Infrastructure as Code (IaC) tools like Terraform, making AWS the authoritative source of truth and reducing Rancher to a workload management tool.
Struggling with EKS node groups getting stuck or failing to update in Rancher? Understand the root cause of this painful state drift and learn three practical fixes, from a quick manual refresh to a permanent Infrastructure as Code solution.
EKS, Rancher, and the Phantom Node Group: My Guide to Winning the State Drift War
I still remember the 2 AM page. A critical deployment for our ‘Orion’ project was failing with scheduling errors, yet Grafana showed our EKS cluster had plenty of spare capacity. I logged into Rancher, and sure enough, the `orion-prod-worker-ng` node group showed 3 nodes, all maxed out. But a quick check in the AWS console showed 6 nodes, humming along happily. The autoscaler had done its job perfectly, but Rancher was living in the past, completely unaware of the new nodes. That’s when you get that sinking feeling—you’re fighting a state drift ghost, and the clock is ticking.
So, What’s Actually Happening Here? The Battle of Two Truths
Before we dive into fixes, you have to understand the core problem. When you manage an imported EKS cluster with Rancher, you have two sources of truth competing for control:
- AWS API: The absolute, undeniable source of truth for your EKS cluster’s infrastructure (control plane, node groups, scaling policies, etc.).
- Rancher’s State Cache: Rancher polls the AWS API to build its own understanding of your cluster. It uses this cached state to display in the UI and manage the cluster.
The problem arises when a change is made directly in AWS (via the console, CLI, or an IaC tool like Terraform) and Rancher’s polling/reconciliation process fails, gets stuck, or is just slow. Rancher’s cache becomes stale. It thinks a node group has a `desired_size` of 3, while AWS knows it’s 6. This “state drift” is the root of 99% of these headaches, especially when Rancher’s controller gets stuck in a `”waiting for infrastructure to be updated”` loop because its version of reality doesn’t match AWS’s.
The Fixes: From Duct Tape to a New Foundation
I’ve seen teams stumble through this for days. Here are the three approaches I use, ranging from a quick fix to get you through the night to the permanent solution you should be aiming for.
Solution 1: The “Kick the Tires” Quick Fix
This is your emergency-level, “get the site back up” maneuver. The goal is to force Rancher to discard its stale cache and re-read the state directly from the AWS API. It’s a manual process and doesn’t prevent the problem from happening again, but it can un-stick a stuck cluster.
- Navigate to your EKS cluster within the Rancher UI.
- Click the triple-dot menu and select “Edit Config”.
- Don’t change anything. Seriously. Just scroll to the bottom and click “Save”.
This action often triggers Rancher’s reconciliation loop for the cluster. It will re-scan the node groups and associated AWS resources. In many cases, this is enough to make it see the correct node count, update its state, and move from an “Updating” to an “Active” state. It’s hacky, but it works when you’re in a pinch.
Solution 2: The Permanent, Grown-Up Fix (Use IaC)
Stop managing your cluster’s infrastructure through a UI. Just stop. Your single source of truth for what your EKS cluster and its node groups should look like must be code stored in Git. Use a tool like Terraform or Pulumi.
When you adopt Infrastructure as Code, you fundamentally change Rancher’s role. It’s no longer the “cluster configuration tool”; it’s the “cluster workload management tool”. You use Terraform to define the node group’s min/max/desired size, instance types, and tags. Rancher simply imports the finished product.
Here’s a simplified Terraform example of what this looks like:
resource "aws_eks_node_group" "prod_web_nodes" {
cluster_name = "techresolve-prod-cluster"
node_group_name = "prod-web-nodegroup-01"
node_role_arn = aws_iam_role.eks_nodes.arn
subnet_ids = module.vpc.private_subnets
instance_types = ["t3.large"]
scaling_config {
desired_size = 3
max_size = 6
min_size = 2
}
update_config {
max_unavailable = 1
}
# Ensure that the EKS Cluster is created before the Node Group
depends_on = [
aws_iam_role_policy_attachment.eks_worker_node_policy,
aws_iam_role_policy_attachment.eks_cni_policy,
aws_iam_role_policy_attachment.eks_container_registry_read_only,
]
}
With this approach, if you need to scale, you change `desired_size` in your `.tf` file, commit it, and run your CI/CD pipeline. AWS and Rancher will both reflect the change correctly because you’re manipulating the true source (AWS), and Rancher is just along for the ride.
Pro Tip: Once you move to IaC, enforce a strict “hands-off” policy for the AWS console and Rancher UI for infrastructure changes. All changes go through a pull request. This is the way.
Solution 3: The “Nuke and Pave” Last Resort
Sometimes, a cluster’s state in Rancher is so thoroughly corrupted that no amount of kicking will fix it. The controllers are in a deadlock, and you’re getting nowhere. This is the time for the ‘nuclear’ option: detach and re-import.
IMPORTANT: This does not delete your EKS cluster in AWS. It only removes Rancher’s management hooks and its internal record of the cluster.
- Backup: Ensure you have backups or GitOps manifests for any applications deployed through Rancher on that cluster.
- Detach: In the Rancher UI, go to the Cluster Management view, find your broken cluster, click the triple-dot menu, and select “Delete”. It will ask for confirmation. This process removes the `cattle-cluster-agent` from your EKS cluster.
- Verify: Run `kubectl get pods -n cattle-system`. Once the agent pods are gone, the cluster is detached.
- Re-import: Go back to Rancher and click “Import Existing”. Select “Amazon EKS” and follow the steps to re-import your cluster. This will generate a new `kubectl` command for you to run, which redeploys a fresh agent.
Rancher will now rediscover the cluster and its node groups from a clean slate, reading the pristine state directly from the AWS API. This is a disruptive option, but it’s a guaranteed way to fix a hopelessly broken state.
Choosing Your Path
Here’s a quick cheat sheet to help you decide.
| Solution | When to Use It | Pros | Cons |
|---|---|---|---|
| 1. The Quick Fix | Emergency situations; P1 incidents where you need an immediate fix. | Fast, easy, no tools required. | Temporary band-aid. Doesn’t prevent recurrence. |
| 2. The IaC Fix | The default for all production environments. The goal you should strive for. | Permanent, repeatable, auditable. The “right” way. | Requires setup, learning curve (Terraform), and team discipline. |
| 3. The Nuclear Option | When all else fails and Rancher’s state is completely broken. | Guaranteed to resolve state drift by starting fresh. | Disruptive, requires re-configuration. A last resort. |
Fighting with UI-based management for core infrastructure is a battle you will eventually lose. Take the time to set up a proper IaC workflow. Your sleep schedule will thank you.
🤖 Frequently Asked Questions
âť“ What causes EKS node groups to get stuck or fail to update in Rancher?
The primary cause is ‘state drift,’ where Rancher’s cached understanding of the EKS cluster’s infrastructure, such as node group sizes or states, diverges from the actual state in the AWS API due to failed or slow reconciliation processes.
âť“ How does managing EKS infrastructure with Infrastructure as Code (IaC) compare to using the Rancher UI?
IaC (e.g., Terraform) establishes AWS as the single, auditable source of truth for EKS infrastructure, preventing state drift and making Rancher a workload management tool. UI-based management is prone to inconsistencies, manual errors, and state drift because Rancher’s cache can become stale.
âť“ What is a common implementation pitfall when using Rancher with EKS, and how can it be solved?
A common pitfall is making direct infrastructure changes in the AWS console, CLI, or even the Rancher UI for core EKS components, which can cause state drift. This is solved by enforcing a strict ‘hands-off’ policy for manual changes and routing all infrastructure updates through IaC and CI/CD pipelines.
Leave a Reply