π Executive Summary
TL;DR: Managing Flux CD in multi-cluster, multi-tenant environments often leads to tangled configurations and high-risk changes within a single GitOps repository. The recommended solution is to adopt a “Hub and Spoke” architecture, separating platform configurations from tenant applications into distinct Git repositories, or to automate Flux configuration with tools like Terraform for large-scale, self-service platforms.
π― Key Takeaways
- Effective multi-cluster, multi-tenant Flux CD requires strict separation of concerns: Platform Config, Tenant Config, and Cluster-Specific Config, to prevent unintended consequences and reduce blast radius.
- The “Hub and Spoke” model is a recommended architecture where a platform-owned ‘Hub’ repository manages core cluster services and defines GitRepository/Kustomization resources that point to separate, tenant-owned ‘Spoke’ repositories for applications, ensuring strong isolation.
- For large-scale organizations, the “Platform Engineering Endgame” approach automates the creation of Flux GitRepository and Kustomization resources using tools like the FluxCD Terraform Provider or Crossplane, enabling self-service and reducing manual YAML management.
Struggling with a messy GitOps repo for your multi-cluster setup? Let’s untangle your Flux CD architecture with battle-tested patterns for managing multiple clusters and tenants without losing your sanity.
Beyond the Bootstrap: A Real-World Guide to Multi-Cluster, Multi-Tenant Flux CD
I remember a Tuesday that went sideways. A junior engineer made a seemingly innocent change to a shared `kustomization.yaml` file. The intent was to roll out a new version of our logging agent to the `dev` cluster. The result? It also rolled out to `prod-db-01` because of a poorly defined overlay, causing a brief but terrifying logging blackout during a peak traffic period. Thatβs when we knew our “simple” single-repo approach for Flux had become a liability. We were playing GitOps Jenga, and the tower was about to fall.
The “Why”: Why Your Simple Flux Repo Becomes a Mess
Look, the default Flux bootstrap is fantastic for getting a single cluster up and running. It’s clean and simple. But the moment you add a second cluster, a third environment, or a second team, that simplicity becomes a trap. You start cramming everything into one repository. Soon, you have a complex web of Kustomize overlays and environment-specific branches that nobody fully understands. The root cause is a failure to separate concerns. You’re mixing three distinct things:
- Platform Config: The stuff that makes a cluster a cluster. Ingress controllers, cert-manager, monitoring stacks, security policies. This is *our* stuff, the platform team’s domain.
- Tenant Config: The applications and services that your development teams actually deploy. The `billing-api`, the `auth-service`, etc.
- Cluster-Specific Config: The glue. The specific domain names for `prod-us-east-1` vs `staging-eu-west-1`, the resource limits, the node selectors.
When these are all tangled together, any change can have unintended consequences. Our goal is to untangle them. Here are a few ways we’ve done it, from the quick fix to the long-term architectural play.
The Fixes: From Duct Tape to a New Foundation
Solution 1: The Monorepo Starter Pack
This is the most common starting point. You stick with a single Git repository for everything, but you enforce a strict directory structure using Kustomize overlays. It’s a step up from chaos, but it requires discipline.
The structure usually looks something like this:
/clusters
βββ/base # Common manifests for all clusters (e.g., flux-system)
β
βββ/prod-us-east-1 # Prod cluster specific entrypoint
β βββ kustomization.yaml # Points to base and overlays
β
βββ/staging-eu-west-1 # Staging cluster specific entrypoint
βββ kustomization.yaml
/apps
βββ/base
β βββ/prometheus # Base prometheus install
β βββ/grafana
β
βββ/overlays
βββ/prod # Prod-specific app configs (e.g., higher replica count)
β βββ prometheus-rules.yaml
β βββ kustomization.yaml
β
βββ/staging # Staging-specific app configs
βββ kustomization.yaml
/infrastructure
βββ/cert-manager
βββ/nginx-ingress
...
In this model, the Flux instance on `prod-us-east-1` is pointed at the `/clusters/prod-us-east-1` path. That Kustomization then pulls in the base manifests and applies the production-specific overlays. It works, but the blast radius for a bad commit is still the entire repository.
Warning: With a monorepo, a single mistake in a `base` Kustomization can still impact every single cluster. Your PR review process has to be absolutely airtight. This structure tightly couples the lifecycle of your infrastructure and your applications.
Solution 2: The “Hub and Spoke” (Separating Platform from People)
This is where we are today, and itβs the architecture I recommend for any serious multi-tenant setup. It’s based on the idea of having more than one Git repository source. We use one repo for the “Hub” (the platform) and separate repos for each “Spoke” (the tenants/teams).
Step 1: The Platform “Hub” Repo
This repo is owned by the DevOps/Platform team. It bootstraps the cluster and manages the core, shared services. Crucially, it also defines the `GitRepository` and `Kustomization` resources that tell Flux to watch the *other* team repos.
# platform-repo/clusters/prod-us-east-1/tenants/billing-team.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: billing-team-repo
namespace: flux-system
spec:
interval: 1m
url: ssh://git@github.com/your-org/billing-team-apps
ref:
branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: billing-team-apps
namespace: flux-system
spec:
interval: 5m
sourceRef:
kind: GitRepository
name: billing-team-repo
path: "./deploy/prod" # Look in this path in their repo
prune: true
targetNamespace: billing-prod # Deploy their stuff into THEIR namespace
Step 2: The Tenant “Spoke” Repo
Each development team gets their own simple Git repo. They don’t need to know about the cluster’s infrastructure; they just manage their application manifests in their own space. The platform team has given them a `billing-prod` namespace to deploy into, and that’s all they need to know.
# billing-team-apps-repo/
/deploy
βββ/prod
β βββ deployment.yaml
β βββ service.yaml
β βββ kustomization.yaml
β
βββ/staging
βββ deployment.yaml
βββ service.yaml
βββ kustomization.yaml
This model gives you fantastic separation. The billing team can’t accidentally break the monitoring stack, and the platform team can’t break the billing app. It scales beautifully because onboarding a new team just means adding a new `GitRepository` source in the platform repo.
Solution 3: The Platform Engineering Endgame
At a certain scale (think dozens of clusters, hundreds of microservices), even managing the “Hub” repo’s YAML files becomes a chore. This is where you stop thinking about managing YAML and start thinking about building a platform. This approach uses other tools to manage Flux’s configuration itself.
Instead of manually writing the `GitRepository` and `Kustomization` YAMLs from Solution 2, you automate their creation. For example:
- Terraform: Use the official FluxCD Terraform Provider. When you provision a new cluster with Terraform, you can also programmatically create the Flux resources that onboard the standard set of tenant applications.
- Crossplane: You can go a step further and create your own “Tenant” Custom Resource Definition (CRD) with Crossplane. When a developer creates a `Tenant` object (e.g., `kubectl apply -f my-new-app.yaml`), a Crossplane composition can automatically create a new Git repo, a namespace, and the corresponding Flux `GitRepository` and `Kustomization` resources.
This is the “nuclear” option in terms of complexity, but for a large organization, it turns DevOps from a ticket-based system into a true self-service platform. It’s not just GitOps; it’s GitOps-driven automation.
Pro Tip: This approach is powerful but adds another layer of abstraction. Don’t jump here unless the pain of managing your “Hub” repo (Solution 2) is significant. Always solve the problem you have today, not the one you might have in five years.
Which Path to Choose?
To make it easier, here’s a quick breakdown of the trade-offs.
| Approach | Complexity | Isolation | Best For |
|---|---|---|---|
| 1. Monorepo Starter | Low | Low | Small teams, 1-3 clusters, low-risk environments. |
| 2. Hub & Spoke | Medium | High | Growing teams, multi-tenant production environments. The sweet spot for most. |
| 3. Platform Endgame | High | Very High | Large enterprises with a dedicated platform team aiming for self-service. |
There is no single “right” answer, only a series of trade-offs. We started with the monorepo, felt the pain, and migrated to the Hub and Spoke model. It’s given us the safety and scalability we needed. Start with the simplest thing that can possibly work, but have a plan for what you’ll do when it doesn’t. Your future self on that 3 AM call will thank you.
π€ Frequently Asked Questions
β What are the primary challenges of using Flux CD in a multi-cluster, multi-tenant environment?
The primary challenges include a messy single GitOps repository, difficulty separating platform, tenant, and cluster-specific configurations, leading to unintended consequences, high blast radius for changes, and scalability issues.
β How does the ‘Hub and Spoke’ Flux CD model compare to a single monorepo approach?
The ‘Hub and Spoke’ model offers significantly higher isolation and scalability by separating platform and tenant configurations into distinct Git repositories, reducing the blast radius of changes. A monorepo, while simpler initially, tightly couples infrastructure and application lifecycles, increasing complexity and risk in multi-tenant setups.
β What is a common implementation pitfall when starting with a Flux CD monorepo for multiple clusters, and how can it be mitigated?
A common pitfall is that a single mistake in a ‘base’ Kustomization within a monorepo can impact every cluster due to tight coupling. This can be mitigated by enforcing an absolutely airtight PR review process and, for growing complexity, migrating to a ‘Hub and Spoke’ architecture for better separation of concerns.
Leave a Reply