🚀 Executive Summary
TL;DR: The “Terralith” problem arises when configuring Cisco ACI from Netbox using Terraform, leading to excessively large and risky plans due to data sources greedily pulling too much information. Solutions involve surgically filtering data, decoupling data gathering from infrastructure deployment, or programmatically generating Terraform configurations to manage state drift and complex dependencies. This approach aims to make Terraform plans smaller, faster, and safer by guiding the tool to retrieve only necessary data.
🎯 Key Takeaways
- Surgical filtering using Netbox tags or custom fields with Terraform’s `filter` blocks dramatically reduces the scope of data sources, making plans smaller and safer.
- The Data Component Pattern decouples Netbox data export into a separate workspace that generates a static data file (e.g., `aci_model.json`), which a second ACI deployment workspace then consumes, severing live dependencies and speeding up plans.
- For complex brownfield environments, generating Terraform JSON configuration files (`.tf.json`) programmatically using a lower-level language (like Python with `pynetbox` and `acitoolkit`) allows for hyper-specific data fetching and transformation, bypassing provider limitations.
Tackle the dreaded ‘Terralith’ problem when configuring Cisco ACI from Netbox. Learn why Terraform plans spiral out of control and discover three practical, real-world solutions to manage state drift and complex dependencies.
Slaying the Terralith: A Senior Engineer’s Guide to Taming Cisco ACI and Netbox
It was 2 AM on a Tuesday. We had a ‘simple’ change window to add a new bridge domain to our production ACI fabric for the `payment-processing` app. I typed `terraform plan` and took a sip of lukewarm coffee. The plan started scrolling. And scrolling. And scrolling. My heart sank. When it finally finished, the terminal glowed with a terrifying summary: Plan: 350 to add, 1,200 to change, 45 to destroy. All for one bridge domain. We’d just met the “Terralith” – a monolithic, terrifyingly interconnected Terraform plan that turns a small change into a high-risk gamble. If you’re reading this, you’ve probably met it, too. Don’t worry, we’re going to slay this beast together.
So, What’s Actually Happening Here? The Root of the Terralith
This isn’t really a bug in Terraform, the Netbox provider, or the ACI provider. It’s a symptom of a glorious, large-scale automation problem. The root cause is simple: your data source is too greedy.
When you declare a data source like netbox_tenants, the provider happily goes to the Netbox API and says, “Give me everything!” It pulls down every single tenant, every VRF, every bridge domain defined in your source of truth. Terraform then builds a massive dependency graph of every possible object. When you try to create a single ACI tenant resource based on that data, Terraform has to evaluate the entire dataset to ensure its state is consistent. Any tiny drift or unintended dependency creates a cascade of proposed “changes” across your fabric.
The goal isn’t to stop using Netbox as a source of truth. The goal is to be smarter about *how* we ask for the data.
Three Ways to Slay the Beast
I’ve fought this battle on multiple fronts. Here are the three strategies that have worked for my teams, ranging from a quick fix to a full architectural shift.
Solution 1: The Quick Fix – Surgical Filtering with for_each
This is your first and best line of defense. Instead of pulling all objects, you tell the provider you’re only interested in a specific subset. The best way to do this is by using a Netbox Tag or a Custom Field.
Let’s say we create a tag in Netbox called `aci-managed`. We apply this tag only to the tenants we want Terraform to control.
Your HCL goes from this (the “greedy” way):
# main.tf - THE BAD WAY
data "netbox_tenants" "all_tenants" {}
resource "aci_tenant" "generated_tenants" {
for_each = { for t in data.netbox_tenants.all_tenants.tenants : t.name => t }
name = each.value.name
description = "Managed by Terraform"
}
To this (the “surgical” way):
# main.tf - THE BETTER WAY
data "netbox_tenants" "aci_managed_tenants" {
filter {
name = "tag"
value = "aci-managed"
}
}
resource "aci_tenant" "generated_tenants" {
# Now the loop only includes tenants with the 'aci-managed' tag
for_each = { for t in data.netbox_tenants.aci_managed_tenants.tenants : t.name => t }
name = each.value.name
description = "Managed by Terraform"
}
This simple change reduces the scope of your data source dramatically. The plan now only considers the objects you’ve explicitly tagged, making it faster, smaller, and infinitely safer.
Solution 2: The Permanent Fix – The Data Component Pattern
If your environment is large and complex, filtering alone might not be enough. The next level of maturity is to decouple your data gathering from your infrastructure deployment. We call this the “Data Component” pattern.
It works by breaking your Terraform configuration into two distinct workspaces that communicate via a simple data file, not direct provider dependencies.
| Component | Responsibility | Output |
| Workspace 1: `netbox-data-export` | Reads all required data from Netbox (tenants, VRFs, EPGs, etc.). Contains NO ACI resources. | A structured JSON or YAML file (e.g., aci_model.json) saved to an artifact repository or S3 bucket. |
| Workspace 2: `aci-fabric-deploy` | Reads the aci_model.json file using jsondecode(). Contains NO Netbox data sources. It only provisions ACI resources based on the file. |
A configured Cisco ACI Fabric. |
This approach completely severs the live dependency between Netbox and ACI during the plan/apply phase. Your `aci-fabric-deploy` plan is now lightning-fast because it’s just reading a local data file. It doesn’t know or care about the other 500 tenants in Netbox. This also gives you a fantastic side benefit: a point-in-time artifact (aci_model.json) that represents the intended state, which is great for auditing and rollback.
A Word of Warning: This pattern introduces a new “state” to manage—the data file itself. Your CI/CD pipeline must be robust enough to ensure the `netbox-data-export` component runs before the `aci-fabric-deploy` component. Out-of-sync data files can be just as dangerous as a Terralith.
Solution 3: The ‘Nuclear’ Option – Roll Your Own Logic
Okay, let’s say you’re in a massive, brownfield environment with years of accumulated cruft. The existing providers are just too chatty, and the logic is too complex for HCL alone. It’s time to consider the nuclear option: drop to a lower-level language.
This involves writing a script (Python is my go-to for this) that uses the Netbox and ACI SDKs (like pynetbox and acitoolkit) to do the following:
- Fetch data from Netbox with hyper-specific, complex logic that you control completely.
- Perform any necessary data transformation or validation in Python.
- Generate a Terraform JSON configuration file (`.tf.json`) directly.
Your CI/CD pipeline then runs this Python script first, which creates the `.tf.json` file. Then, it runs `terraform plan` and `apply` as usual. Terraform simply executes the configuration file it was given, with no complex data sources to resolve.
This is a last resort. You are essentially writing your own mini-provider, and you take on the burden of maintaining it. But for a team with the right skills facing an intractable problem, it can turn an impossible situation into a manageable, deterministic process.
Stop Fighting the Tool, Start Guiding It
The Terralith isn’t a monster to be feared; it’s a sign that your automation has reached a new level of scale. By shifting your thinking from “get all the data” to “get only the data I need,” you can tame the beast. Start with filtering, evolve to decoupling, and keep the nuclear option in your back pocket. Your 2 AM self will thank you for it.
🤖 Frequently Asked Questions
âť“ What is the ‘Terralith’ problem in Cisco ACI and Netbox automation?
The ‘Terralith’ refers to a monolithic Terraform plan that proposes an excessive number of changes (adds, modifies, destroys) for a small configuration update in Cisco ACI when using Netbox as a source of truth. It’s caused by Terraform’s data sources being ‘too greedy,’ pulling down all objects from Netbox and creating a massive dependency graph, leading to state drift and cascading changes.
âť“ How do the proposed solutions compare in terms of effort and impact?
Surgical filtering is a quick, low-effort fix for immediate impact on plan size. The Data Component Pattern requires more architectural effort but provides a robust, decoupled solution with auditing benefits. The ‘Nuclear Option’ (custom logic) is the highest effort, reserved for intractable problems, offering complete control but increasing maintenance burden.
âť“ What is a common implementation pitfall when using the Data Component Pattern?
A common pitfall is managing the new ‘state’ introduced by the data file (e.g., `aci_model.json`). If the `netbox-data-export` component doesn’t run before the `aci-fabric-deploy` component, or if the data file gets out of sync, it can lead to inconsistent or dangerous deployments. Robust CI/CD pipeline orchestration is crucial to ensure proper sequencing.
Leave a Reply