🚀 Executive Summary
TL;DR: The kube-controller-manager (KCM) does not call or interact with external Cloud Controller Managers (CCMs), leading to nodes being stuck in an ‘uninitialized’ state. The core solution involves explicitly configuring both the KCM and kubelets with the `–cloud-provider=external` flag to prevent the KCM from attempting to manage cloud provider logic.
🎯 Key Takeaways
- The kube-controller-manager (KCM) operates independently and does not invoke external Cloud Controller Managers (CCMs), causing a deadlock if not properly configured.
- To correctly integrate an external CCM, both the kube-controller-manager and kubelet on all nodes must be configured with the `–cloud-provider=external` flag.
- Configuration for `–cloud-provider=external` varies by cluster provisioner (e.g., kubeadm, Kubespray, Kops), requiring updates to specific configuration files or APIs to ensure persistence across upgrades.
Confused why your Kubernetes nodes are stuck after installing an external cloud controller manager (CCM)? This post breaks down the common misconception about the kube-controller-manager and gives you three battle-tested ways to fix the dreaded ‘uninitialized’ node state.
My Nodes Won’t Register! A Senior Engineer’s Guide to Kube-Controller-Manager vs. External CCMs
I still remember the PagerDuty alert. It was 2 AM on a Tuesday, of course. The brand new ‘Phoenix’ cluster, our flagship project on a custom bare-metal cloud, was completely dead in the water. My junior engineer, Alex, was frantically messaging me. “The nodes are stuck in `NotReady`! I deployed our new external cloud controller, but nothing is happening!” A quick `kubectl describe node prod-worker-01` told me everything I needed to know. I saw that ugly, familiar taint: node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule. Alex had fallen into one of the most common, and most frustrating, traps in Kubernetes architecture: assuming the main controller manager knows what to do with your fancy new external one. It doesn’t.
The “Why”: Two Captains on One Ship
Let’s get this straight, because it’s the root of all the pain. The kube-controller-manager (KCM) does NOT call your external Cloud Controller Manager (CCM). Ever. Thinking it does is the core misunderstanding.
Here’s what’s really happening:
- The default, built-in kube-controller-manager has code for major cloud providers (AWS, Azure, GCP) baked right in. We call this the “in-tree” provider.
- When your cluster starts, the KCM looks at a node and says, “Aha! I need to initialize this node, assign it a ProviderID, and manage its lifecycle events.” It then tries to do this using its in-tree code.
- You, the savvy engineer, deploy an external CCM to handle your specific cloud (like vSphere, OpenStack, or in our case, a custom bare-metal provider). This is the “out-of-tree” provider.
The problem is, you now have two different controllers trying to do the same job. The KCM gets there first, sees an uninitialized node, and slaps that `uninitialized` taint on it, waiting for its *own* internal cloud logic to run. But you haven’t configured any in-tree provider, so it just waits forever. Your external CCM is running, but it’s respectfully waiting for that taint to be removed before it takes over, a taint that the KCM will never remove. You’ve created a deadlock.
The fix is to tell the kube-controller-manager to back off completely and let the external CCM do its job.
The Fixes: From Duct Tape to a Proper Weld
I’ve seen this fixed a few ways, depending on the urgency and how the cluster was built. Here are my go-to methods.
Fix #1: The “It’s 3 AM and I Need Sleep” Quick Fix
This is the emergency, get-the-lights-back-on approach. We’re going to directly edit the static pod manifest for the kube-controller-manager on the master nodes to tell it to get out of the cloud business.
Steps:
- SSH into your primary control-plane node (e.g., `prod-k8s-master-01`).
- Find the static pod manifest for the KCM. It’s usually in
/etc/kubernetes/manifests/kube-controller-manager.yaml. - Add one crucial flag to the command arguments:
--cloud-provider=external.
# Edit this file: /etc/kubernetes/manifests/kube-controller-manager.yaml
...
spec:
containers:
- command:
- kube-controller-manager
# --authentication-kubeconfig=... (existing flags)
# --authorization-kubeconfig=... (existing flags)
# ... more flags ...
- --cloud-provider=external # <-- ADD THIS LINE!
# ... more flags ...
...
Once you save the file, the kubelet will detect the change and restart the kube-controller-manager pod. Within a minute or two, you should see your external CCM spring to life, initialize the nodes, and remove the taint.
Warning: This is a hacky fix! If you use a tool that manages your cluster configuration (like kubeadm, kops, etc.), your next cluster upgrade will likely wipe out this manual change. Use this to stop the bleeding, then apply the permanent fix.
Fix #2: The “Do It Right” Permanent Fix
The correct way to solve this is to declare your intention to use an external cloud provider in your cluster’s core configuration. This way, the setting persists across upgrades and new node additions.
How you do this depends entirely on how you provisioned your cluster. Here’s an example for a cluster managed by kubeadm:
You need to modify your ClusterConfiguration object. If you’re setting up a new cluster, you’d put this in your init configuration file. If you’re fixing an existing one, you can use kubectl edit cm -n kube-system kubeadm-config and make the change there, then follow the kubeadm upgrade workflow.
# In your kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.28.2
controllerManager:
extraArgs:
cloud-provider: "external" # <-- This is the key
---
# You also need to tell the kubelets on ALL nodes
# to use an external provider.
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cloudProvider: "external" # <-- This is also critical
By setting this in your declarative configuration, you’ve made your intent clear. The control plane will always come up with the right flags, and you won’t get paged at 2 AM again for this specific problem.
| Provisioner | Typical Configuration Location |
| Kubeadm | ClusterConfiguration ConfigMap |
| Kubespray | Inventory variables (e.g., cloud_provider: external) |
| Kops | Cluster Spec (kops edit cluster) |
| RKE/Rancher | cluster.yml or Cluster Manager UI options |
Fix #3: The “Last Resort” Nuclear Option
Sometimes, things get really messy. Maybe the KCM ran for a while and incorrectly assigned `providerID`s to your nodes before you shut it down. Now your external CCM is confused and can’t reconcile the state. In this case, you need to wipe the slate clean.
Pro Tip: This is highly disruptive. Do this during a maintenance window. You are essentially re-registering your nodes.
Steps:
- Implement Fix #2! Make sure the KCM is permanently configured with
--cloud-provider=externalbefore you do anything else. - Cordon and drain a worker node you want to fix:
kubectl drain prod-worker-01 --ignore-daemonsets --delete-emptydir-data. - Delete the node object from the Kubernetes API:
kubectl delete node prod-worker-01. - SSH into that worker node (`prod-worker-01`).
- Restart the kubelet service:
sudo systemctl restart kubelet.
By restarting the kubelet, it will attempt to re-register itself with the API server. Since the KCM is now correctly configured to ignore it, your external CCM will be the *only* controller that sees this new, uninitialized node. It will then correctly take over, assign the right `providerID`, initialize it, and let it join the cluster in a `Ready` state. Repeat for all your nodes, one by one.
It’s a pain, but it guarantees a clean state. We had to do this for two of the nodes in the ‘Phoenix’ cluster to get them back online. It taught Alex a valuable lesson about controller responsibility that you don’t learn from the docs. You learn it in the trenches.
🤖 Frequently Asked Questions
âť“ Does the kube-controller-manager (KCM) automatically integrate with an external Cloud Controller Manager (CCM)?
No, the kube-controller-manager does not call or interact with external CCMs. This misunderstanding often leads to a deadlock where the KCM applies an `uninitialized` taint, waiting for its own in-tree logic, while the external CCM waits for the taint to be removed.
âť“ What is the difference between in-tree and out-of-tree cloud providers in Kubernetes?
In-tree cloud providers have their logic baked directly into the kube-controller-manager for major clouds like AWS or GCP. Out-of-tree providers are external CCMs deployed separately to manage custom or specific cloud environments, requiring explicit configuration to disable the KCM’s in-tree cloud logic.
âť“ What is a common implementation pitfall when deploying an external Cloud Controller Manager and how is it resolved?
A common pitfall is nodes getting stuck with the `node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule` taint because the kube-controller-manager is still attempting to use its in-tree cloud logic. This is resolved by configuring both the kube-controller-manager and kubelet with the `–cloud-provider=external` flag.
Leave a Reply