🚀 Executive Summary
TL;DR: Fragmented Terraform/OpenTofu repositories across multiple teams lead to critical issues like configuration drift and dangerous outages due to uncoordinated changes. The article proposes a tiered approach to solve this, ranging from a quick centralized state output catalog to a comprehensive platform abstraction layer, to regain control and scale IaC effectively.
🎯 Key Takeaways
- A Centralized State Output Catalog acts as a quick, cultural fix to establish IaC ‘contracts’ by documenting official `remote_state` configurations and outputs, preventing ad-hoc state consumption.
- Implementing a Module Monorepo with a Private Registry provides a permanent solution by centralizing reusable, company-approved Terraform/OpenTofu modules, enforcing standards, and significantly reducing boilerplate and security review overhead.
- The Platform Abstraction Layer, or Internal Developer Platform (IDP), is the ultimate solution for hyper-scale, enabling application teams to define infrastructure needs via simplified manifests, abstracting away direct Terraform interaction and maximizing developer velocity.
Struggling with fragmented Terraform or OpenTofu repos across teams? This guide explores the root causes and offers three practical solutions, from quick fixes to long-term architectural changes, to help you regain control of your IaC at scale.
IaC at Scale: Taming the Hydra of Fragmented Terraform Repos
I remember a 2 AM PagerDuty alert. The core authentication service was down, and our status page was bleeding red. The cause? A seemingly unrelated change to a networking module by the Analytics team. Their Terraform repo had a stale remote_state data source pointing to an old S3 bucket for our VPC config. Their ‘apply’ reverted a critical security group rule we had patched an hour earlier. That’s when I knew our “let a thousand flowers bloom” approach to IaC was actually just letting a thousand weeds choke the garden. It’s a story I’ve seen play out at nearly every company I’ve worked for, and it’s a direct result of well-intentioned autonomy colliding with scale.
Why Does This Fragmentation Even Happen?
Let’s be honest, it rarely starts with bad intentions. It’s the natural outcome of Conway’s Law: your systems architecture will mirror your organization’s communication structure. You have multiple teams, so you get multiple repos. Each team wants to move fast, so they create their own little IaC kingdom. Initially, this feels like agility. The ‘Platform’ team builds the VPC, the ‘Apps’ team builds the EC2s, and the ‘Data’ team builds the Redshift cluster.
The problem creeps in silently. It starts with duplication. Then comes drift. Team A needs a security group, so does Team B. They both write their own modules, slightly different. Soon, nobody knows which repo is the source of truth for what. Finding the owner of a resource like prod-db-security-group-alpha requires a frantic search through a dozen different repositories. This isn’t just inefficient; as my war story shows, it’s downright dangerous.
Three Paths to Reclaiming Sanity
You can’t boil the ocean, but you can start treating the wounds. I’ve seen three approaches work in the real world, each with its own trade-offs. We’ll go from the quick band-aid to major surgery.
1. The Quick Fix: A Centralized State Output Catalog
This is the “stop the bleeding now” approach. The immediate problem is teams randomly grabbing state from each other without a clear contract. The fix is to create one. You establish a central, well-documented place (a Git repo with markdown files, a Confluence page, whatever) that becomes the single source of truth for shareable outputs.
A team that owns a core component (like networking) is responsible for publishing the exact `remote_state` configuration and available outputs to this catalog. Other teams are then mandated to pull from this catalog, not from memory or an old Slack message.
What NOT to do:
# in the app-team repo... a total guess.
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "some-random-tf-state-bucket-we-think" # Is this right? Who knows.
key = "networking/terraform.tfstate"
region = "us-east-1"
}
}
# Is it vpc_id or vpc_id_output? Let's try both!
resource "aws_instance" "app_server" {
vpc_security_group_ids = [data.terraform_remote_state.network.outputs.web_sg_id]
subnet_id = data.terraform_remote_state.network.outputs.private_subnet_a # Did this name change?
# ...
}
The better, catalog-driven approach:
The developer now goes to the “State Catalog” and finds the official, versioned snippet for the Production VPC:
# Pulled from the central catalog - DO NOT MODIFY
data "terraform_remote_state" "network_prod_v1" {
backend = "s3"
config = {
bucket = "tf-state-prod-us-east-1-core"
key = "global/network/terraform.tfstate"
region = "us-east-1"
}
}
# The catalog guarantees these output names are stable
resource "aws_instance" "app_server" {
vpc_security_group_ids = [data.terraform_remote_state.network_prod_v1.outputs.default_app_sg_id]
subnet_id = data.terraform_remote_state.network_prod_v1.outputs.private_app_subnets[0]
# ...
}
Pro Tip: This is a cultural fix as much as a technical one. It’s hacky, yes, but it forces conversations between teams and establishes the idea of IaC “contracts.” It’s the first step away from the wild west.
2. The Permanent Fix: A Module Monorepo & Private Registry
The next level of maturity is to stop teams from reinventing the wheel. If three different teams need an S3 bucket with encryption and logging, there should be one blessed module for that, not three copy-pasted versions.
This is where a monorepo for IaC modules, combined with a private module registry, shines.
- The Monorepo: A single Git repository holds all the common, reusable, and company-approved Terraform/Tofu modules (e.g., for creating an RDS instance, a security group, an S3 bucket). This repo has strict PR processes, linting, and automated testing.
- The Private Registry: You publish versioned artifacts from the monorepo to a registry. This could be Terraform Cloud/Enterprise, Artifactory, or even just using Git tags as a source.
Now, application teams don’t write complex resources. They consume simple, standardized modules.
# In the app-team's repo - clean, simple, and standardized.
module "my_app_db" {
source = "app.terraform.io/techresolve/rds-postgres/aws"
version = "~> 2.1"
instance_class = "db.t3.medium"
allocated_storage = 50
db_name = "my_app_prod"
environment = "prod"
# Encryption, logging, backups, tagging are all handled by the module!
}
module "my_app_bucket" {
source = "app.terraform.io/techresolve/s3-private/aws"
version = "~> 1.4"
bucket_name = "techresolve-app-prod-backups"
# Lifecycle policies and IAM controls are baked in.
}
This enforces standards, reduces boilerplate, and makes security reviews infinitely easier. You’re no longer reviewing 200 lines of raw AWS resources; you’re reviewing 10 lines that call a trusted, pre-vetted module.
3. The ‘Nuclear’ Option: The Platform Abstraction Layer
This is the endgame for many orgs at true hyper-scale. You accept that most developers don’t want to—and shouldn’t have to—become Terraform experts. You build a true Platform Engineering team that provides infrastructure as a service, not just as code.
In this model, the application teams may not interact with Terraform directly at all for 90% of their needs. Instead, they define their infrastructure needs in a simplified manifest file right inside their application’s repository.
A central CI/CD platform (e.g., GitLab, GitHub Actions) reads this manifest, validates it, and runs the master Terraform/Tofu configuration on the team’s behalf. The platform team owns the complex IaC, and the app teams own the simple definition.
Let’s compare the developer’s experience:
| Old Way (Direct IaC) | New Way (Platform Abstraction) |
|---|---|
Developer needs to understand Terraform syntax, set up backends, manage providers, and learn the company’s custom modules. They run terraform plan/apply themselves. |
Developer edits a simple app-manifest.yaml file in their service repo. A central pipeline does the rest. |
|
|
Warning: This is a massive undertaking. You are effectively building an Internal Developer Platform (IDP). It requires a dedicated team and a significant investment. Don’t jump here from day one, but know that this is the direction to go when your goal is to maximize developer velocity and minimize infrastructure cognitive load.
Ultimately, there’s no single right answer. The key is to recognize when the pain of fragmentation outweighs the initial benefit of team autonomy. Start with the catalog, evolve to a module registry, and keep the platform dream in your back pocket for when you’re ready to truly scale.
🤖 Frequently Asked Questions
âť“ What are the root causes of fragmented IaC repositories at scale?
Fragmentation often stems from Conway’s Law, where organizational structure dictates system architecture. Teams seeking autonomy create their own IaC ‘kingdoms,’ leading to duplication, configuration drift, and difficulty in identifying the source of truth for shared resources.
âť“ How do the proposed solutions address the problem of ‘stale remote_state’ data sources?
The Centralized State Output Catalog directly addresses this by providing a single, documented source for `remote_state` configurations and guaranteed output names, preventing teams from guessing or using outdated references. The module monorepo and platform abstraction layer further reduce this risk by centralizing module consumption and abstracting infrastructure definitions.
âť“ What is a common pitfall when adopting a module monorepo and private registry?
A common pitfall is neglecting to establish strict PR processes, linting, and automated testing for the monorepo. Without these controls, the private registry can become populated with inconsistent or low-quality modules, undermining the goal of standardization and reliability.
Leave a Reply