πŸš€ Executive Summary

TL;DR: Managing Flux CD in multi-cluster, multi-tenant environments often leads to tangled configurations and high-risk changes within a single GitOps repository. The recommended solution is to adopt a “Hub and Spoke” architecture, separating platform configurations from tenant applications into distinct Git repositories, or to automate Flux configuration with tools like Terraform for large-scale, self-service platforms.

🎯 Key Takeaways

  • Effective multi-cluster, multi-tenant Flux CD requires strict separation of concerns: Platform Config, Tenant Config, and Cluster-Specific Config, to prevent unintended consequences and reduce blast radius.
  • The “Hub and Spoke” model is a recommended architecture where a platform-owned ‘Hub’ repository manages core cluster services and defines GitRepository/Kustomization resources that point to separate, tenant-owned ‘Spoke’ repositories for applications, ensuring strong isolation.
  • For large-scale organizations, the “Platform Engineering Endgame” approach automates the creation of Flux GitRepository and Kustomization resources using tools like the FluxCD Terraform Provider or Crossplane, enabling self-service and reducing manual YAML management.

Flux CD: D1 Reference Architecture (multi-cluster, multi-tenant)

Struggling with a messy GitOps repo for your multi-cluster setup? Let’s untangle your Flux CD architecture with battle-tested patterns for managing multiple clusters and tenants without losing your sanity.

Beyond the Bootstrap: A Real-World Guide to Multi-Cluster, Multi-Tenant Flux CD

I remember a Tuesday that went sideways. A junior engineer made a seemingly innocent change to a shared `kustomization.yaml` file. The intent was to roll out a new version of our logging agent to the `dev` cluster. The result? It also rolled out to `prod-db-01` because of a poorly defined overlay, causing a brief but terrifying logging blackout during a peak traffic period. That’s when we knew our “simple” single-repo approach for Flux had become a liability. We were playing GitOps Jenga, and the tower was about to fall.

The “Why”: Why Your Simple Flux Repo Becomes a Mess

Look, the default Flux bootstrap is fantastic for getting a single cluster up and running. It’s clean and simple. But the moment you add a second cluster, a third environment, or a second team, that simplicity becomes a trap. You start cramming everything into one repository. Soon, you have a complex web of Kustomize overlays and environment-specific branches that nobody fully understands. The root cause is a failure to separate concerns. You’re mixing three distinct things:

  • Platform Config: The stuff that makes a cluster a cluster. Ingress controllers, cert-manager, monitoring stacks, security policies. This is *our* stuff, the platform team’s domain.
  • Tenant Config: The applications and services that your development teams actually deploy. The `billing-api`, the `auth-service`, etc.
  • Cluster-Specific Config: The glue. The specific domain names for `prod-us-east-1` vs `staging-eu-west-1`, the resource limits, the node selectors.

When these are all tangled together, any change can have unintended consequences. Our goal is to untangle them. Here are a few ways we’ve done it, from the quick fix to the long-term architectural play.


The Fixes: From Duct Tape to a New Foundation

Solution 1: The Monorepo Starter Pack

This is the most common starting point. You stick with a single Git repository for everything, but you enforce a strict directory structure using Kustomize overlays. It’s a step up from chaos, but it requires discipline.

The structure usually looks something like this:


/clusters
β”œβ”€β”€/base                # Common manifests for all clusters (e.g., flux-system)
β”‚
β”œβ”€β”€/prod-us-east-1       # Prod cluster specific entrypoint
β”‚  └── kustomization.yaml # Points to base and overlays
β”‚
└──/staging-eu-west-1    # Staging cluster specific entrypoint
   └── kustomization.yaml

/apps
β”œβ”€β”€/base
β”‚  β”œβ”€β”€/prometheus       # Base prometheus install
β”‚  └──/grafana
β”‚
└──/overlays
   β”œβ”€β”€/prod             # Prod-specific app configs (e.g., higher replica count)
   β”‚  β”œβ”€β”€ prometheus-rules.yaml
   β”‚  └── kustomization.yaml
   β”‚
   └──/staging          # Staging-specific app configs
      └── kustomization.yaml

/infrastructure
β”œβ”€β”€/cert-manager
β”œβ”€β”€/nginx-ingress
...

In this model, the Flux instance on `prod-us-east-1` is pointed at the `/clusters/prod-us-east-1` path. That Kustomization then pulls in the base manifests and applies the production-specific overlays. It works, but the blast radius for a bad commit is still the entire repository.

Warning: With a monorepo, a single mistake in a `base` Kustomization can still impact every single cluster. Your PR review process has to be absolutely airtight. This structure tightly couples the lifecycle of your infrastructure and your applications.

Solution 2: The “Hub and Spoke” (Separating Platform from People)

This is where we are today, and it’s the architecture I recommend for any serious multi-tenant setup. It’s based on the idea of having more than one Git repository source. We use one repo for the “Hub” (the platform) and separate repos for each “Spoke” (the tenants/teams).

Step 1: The Platform “Hub” Repo

This repo is owned by the DevOps/Platform team. It bootstraps the cluster and manages the core, shared services. Crucially, it also defines the `GitRepository` and `Kustomization` resources that tell Flux to watch the *other* team repos.


# platform-repo/clusters/prod-us-east-1/tenants/billing-team.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: billing-team-repo
  namespace: flux-system
spec:
  interval: 1m
  url: ssh://git@github.com/your-org/billing-team-apps
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: billing-team-apps
  namespace: flux-system
spec:
  interval: 5m
  sourceRef:
    kind: GitRepository
    name: billing-team-repo
  path: "./deploy/prod" # Look in this path in their repo
  prune: true
  targetNamespace: billing-prod # Deploy their stuff into THEIR namespace

Step 2: The Tenant “Spoke” Repo

Each development team gets their own simple Git repo. They don’t need to know about the cluster’s infrastructure; they just manage their application manifests in their own space. The platform team has given them a `billing-prod` namespace to deploy into, and that’s all they need to know.


# billing-team-apps-repo/
/deploy
β”œβ”€β”€/prod
β”‚  β”œβ”€β”€ deployment.yaml
β”‚  β”œβ”€β”€ service.yaml
β”‚  └── kustomization.yaml
β”‚
└──/staging
   β”œβ”€β”€ deployment.yaml
   β”œβ”€β”€ service.yaml
   └── kustomization.yaml

This model gives you fantastic separation. The billing team can’t accidentally break the monitoring stack, and the platform team can’t break the billing app. It scales beautifully because onboarding a new team just means adding a new `GitRepository` source in the platform repo.

Solution 3: The Platform Engineering Endgame

At a certain scale (think dozens of clusters, hundreds of microservices), even managing the “Hub” repo’s YAML files becomes a chore. This is where you stop thinking about managing YAML and start thinking about building a platform. This approach uses other tools to manage Flux’s configuration itself.

Instead of manually writing the `GitRepository` and `Kustomization` YAMLs from Solution 2, you automate their creation. For example:

  • Terraform: Use the official FluxCD Terraform Provider. When you provision a new cluster with Terraform, you can also programmatically create the Flux resources that onboard the standard set of tenant applications.
  • Crossplane: You can go a step further and create your own “Tenant” Custom Resource Definition (CRD) with Crossplane. When a developer creates a `Tenant` object (e.g., `kubectl apply -f my-new-app.yaml`), a Crossplane composition can automatically create a new Git repo, a namespace, and the corresponding Flux `GitRepository` and `Kustomization` resources.

This is the “nuclear” option in terms of complexity, but for a large organization, it turns DevOps from a ticket-based system into a true self-service platform. It’s not just GitOps; it’s GitOps-driven automation.

Pro Tip: This approach is powerful but adds another layer of abstraction. Don’t jump here unless the pain of managing your “Hub” repo (Solution 2) is significant. Always solve the problem you have today, not the one you might have in five years.

Which Path to Choose?

To make it easier, here’s a quick breakdown of the trade-offs.

Approach Complexity Isolation Best For
1. Monorepo Starter Low Low Small teams, 1-3 clusters, low-risk environments.
2. Hub & Spoke Medium High Growing teams, multi-tenant production environments. The sweet spot for most.
3. Platform Endgame High Very High Large enterprises with a dedicated platform team aiming for self-service.

There is no single “right” answer, only a series of trade-offs. We started with the monorepo, felt the pain, and migrated to the Hub and Spoke model. It’s given us the safety and scalability we needed. Start with the simplest thing that can possibly work, but have a plan for what you’ll do when it doesn’t. Your future self on that 3 AM call will thank you.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


πŸ€– Frequently Asked Questions

❓ What are the primary challenges of using Flux CD in a multi-cluster, multi-tenant environment?

The primary challenges include a messy single GitOps repository, difficulty separating platform, tenant, and cluster-specific configurations, leading to unintended consequences, high blast radius for changes, and scalability issues.

❓ How does the ‘Hub and Spoke’ Flux CD model compare to a single monorepo approach?

The ‘Hub and Spoke’ model offers significantly higher isolation and scalability by separating platform and tenant configurations into distinct Git repositories, reducing the blast radius of changes. A monorepo, while simpler initially, tightly couples infrastructure and application lifecycles, increasing complexity and risk in multi-tenant setups.

❓ What is a common implementation pitfall when starting with a Flux CD monorepo for multiple clusters, and how can it be mitigated?

A common pitfall is that a single mistake in a ‘base’ Kustomization within a monorepo can impact every cluster due to tight coupling. This can be mitigated by enforcing an absolutely airtight PR review process and, for growing complexity, migrating to a ‘Hub and Spoke’ architecture for better separation of concerns.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading