🚀 Executive Summary

TL;DR: GKE Autopilot clusters can deadlock when installing cert-manager due to its mutating webhook attempting to modify critical `kube-system` components, which Autopilot’s strict security model denies. The solution involves configuring cert-manager’s webhook via Helm to explicitly ignore the `kube-system` namespace, preventing this feature collision.

🎯 Key Takeaways

GKE Autopilot’s strict security model prevents modification of resources within critical namespaces like `kube-system`.
cert-manager’s mutating admission webhook, by default, intercepts and attempts to modify all pod creation requests, leading to a deadlock when it targets Autopilot’s internal components.
The recommended and most robust solution is to configure cert-manager’s Helm chart with a `namespaceSelector` to explicitly exclude `kube-system` from webhook validation, preventing the deadlock during installation.

Struggling to get cert-manager installed in a GKE Autopilot cluster

Struggling with cert-manager in a GKE Autopilot cluster? Understand the root cause of the webhook deadlock and discover three practical fixes, from quick post-install labels to robust, GitOps-friendly Helm configurations that prevent the problem entirely.

That Time cert-manager Bricked Our GKE Autopilot Cluster (And How We Fixed It)

It was 10 PM on a Tuesday. We were rolling out a minor update to our billing service. Everything looked green. Then, a junior engineer on my team, let’s call him Alex, decided to do some “quick housekeeping” and upgrade cert-manager using the latest Helm chart. Five minutes later, Slack exploded. Alerts started firing for services that hadn’t even been touched. I tried to `kubectl get pods` and the command just hung. The GKE console was a sea of red. We had a full-blown, self-inflicted outage, and it took us another hour to figure out that the combination of a “secure-by-default” Autopilot cluster and a standard cert-manager install had created a deadly embrace that was choking our control plane.

If you’re here, you’ve probably felt that same cold sweat. You followed the docs, you ran the install command, and now your cluster is unresponsive. Don’t panic. You’re not alone, and the fix is easier than you think. Let’s dig in.

The “Why”: A Tale of Two Webhooks

So what’s actually happening here? This isn’t a bug in cert-manager or GKE Autopilot; it’s a feature collision.

GKE Autopilot’s Security Posture: Autopilot is fantastic because it manages the nodes for you. Part of that “management” is a very strict security model. It heavily restricts what can run in, or modify resources within, critical namespaces like kube-system. This is a good thing—it prevents you from accidentally breaking your own control plane.
cert-manager’s Mutating Webhook: A key part of cert-manager is its mutating admission webhook. When you create a Pod that needs a certificate, this webhook intercepts the request to the Kubernetes API server and injects the necessary CA bundle information and volume mounts. This is how the magic happens automatically.

Here’s the deadlock: When Autopilot needs to schedule its own critical components in the kube-system namespace (like the konnectivity-agent pods that allow the control plane to talk to the nodes), the cert-manager webhook tries to intercept and mutate them. But Autopilot’s security model sees this and says, “Nope! You are not allowed to modify objects in this protected namespace.” The API server denies the request, the pod fails to start, and the control plane loses contact with the nodes. Game over.

The Fixes: From Emergency Patch to Architectural Solution

Okay, enough theory. You need to get your cluster back. We’ve got three ways to tackle this, ranging from the quick-and-dirty to the proper, long-term solution.

Fix #1: The ‘Get Me Out of Jail’ Fix

This is the fastest way to fix a cluster that is already broken. The goal is to tell the cert-manager webhook to flat-out ignore the kube-system namespace. We do this by applying a specific label to the namespace that the webhook’s default configuration understands.

kubectl label namespace kube-system cert-manager.io/disable-validation=true

This command tells cert-manager, “Hey, see this namespace? Don’t touch anything in it.” The webhook will stop intercepting requests for kube-system, allowing the critical Autopilot pods to schedule, and your cluster will come back to life within a few minutes.

My Take: This is a band-aid. It’s effective for an emergency, but it’s manual and easy to forget. If you rebuild the cluster or someone removes the label, you’re right back where you started. Use this to stop the bleeding, then implement Fix #2.

Fix #2: The ‘Do It Right the First Time’ Fix

This is the solution you should be using in your CI/CD pipelines and GitOps workflows. Instead of fixing the problem after installation, we’re going to tell Helm to configure cert-manager correctly from the very beginning.

You do this by overriding the default values in the Helm chart. Create a values.yaml file for your cert-manager installation and add the following configuration:

# values-cert-manager-gke.yaml
installCRDs: true

webhook:
  namespaceSelector:
    matchExpressions:
      - key: cert-manager.io/disable-validation
        operator: NotIn
        values:
        - "true"
      - key: name
        operator: NotIn
        values:
        - kube-system

Now, install or upgrade cert-manager using this file:

helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.14.4 \
  -f values-cert-manager-gke.yaml

This namespaceSelector explicitly configures the webhook to ignore any namespace with the cert-manager.io/disable-validation=true label OR the name kube-system. It’s declarative, it’s idempotent, and it lives in your source control. This is how we do it at TechResolve.

Fix #3: The ‘Paranoid Architect’ Fix

Sometimes, you need a failsafe. What if another critical namespace gets added by Google in the future? The namespaceSelector is great, but it’s specific. An alternative approach is to change the webhook’s failurePolicy. By default, it’s set to Fail, which means if the webhook can’t process a request for any reason, the entire resource creation fails (causing our deadlock). We can change it to Ignore.

In your Helm values.yaml, you can configure it like this:

# values-cert-manager-gke-failsafe.yaml
installCRDs: true

webhook:
  failurePolicy: Ignore # Default is 'Fail'

This tells the Kubernetes API server, “Try to send this request to the cert-manager webhook. But if the webhook is down, or if it returns an error… just ignore it and create the resource anyway.”

Warning: This is a double-edged sword. This will absolutely prevent the cluster-wide deadlock scenario. However, it can also mask other problems. If your webhook is misconfigured or crashing, this policy will cause certificate injection to fail silently. Your pod will start, but it won’t have its TLS certs. Use this with caution and ensure you have strong monitoring to detect when certificate injection fails.

My Two Cents

In the trenches, things break. Autopilot’s managed security is a powerful asset, but it demands you understand its constraints. For us, the “aha!” moment wasn’t just fixing the cluster, it was realizing we need to treat addons like cert-manager as first-class applications with configurations tailored to the environment. Don’t just copy-paste the default install command. Read the chart’s values, understand what the webhooks are doing, and use the ‘Do It Right The First Time’ (Fix #2) approach. It’ll save you that 10 PM outage call.

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ What causes cert-manager to break GKE Autopilot clusters?

The cert-manager mutating webhook attempts to modify pods in the `kube-system` namespace, which GKE Autopilot’s strict security model denies, leading to a deadlock that prevents critical control plane components from scheduling.

❓ How does configuring `namespaceSelector` compare to changing `failurePolicy`?

Configuring `namespaceSelector` is a precise, declarative method to prevent the webhook from acting on specific namespaces like `kube-system`, ensuring security and predictable behavior. Changing `failurePolicy` to `Ignore` is a broader failsafe that prevents deadlocks but can mask silent failures in certificate injection, potentially leading to applications running without valid TLS.

❓ What is a common implementation pitfall when installing cert-manager in GKE Autopilot?

A common pitfall is using the default cert-manager Helm installation without specific configurations for Autopilot, leading to a cluster-wide deadlock. This is avoided by using a `values.yaml` file to configure the webhook’s `namespaceSelector` to explicitly exclude the `kube-system` namespace during installation.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply