🚀 Executive Summary
TL;DR: Engineering teams often prioritize ‘DevOps marketing’ and shiny new tools over foundational reliability, leading to unstable services and accumulated tech debt. The solution involves refocusing on ‘good work’ through structured root cause analysis, mandating Service Level Objectives (SLOs) with error budgets, and empowering development teams with direct operational responsibility for their services.
🎯 Key Takeaways
- Utilize the ‘Five Whys’ intervention to diagnose the true root causes of system issues, moving beyond symptoms to identify underlying code or process flaws (e.g., N+1 problems).
- Implement a ‘Boring Backlog’ for critical reliability work and enforce Service Level Objectives (SLOs) with error budgets, halting new feature development when the budget is consumed to prioritize stability.
- Adopt a ‘You Build It, You Run It’ model, making development teams responsible for the operational health of their code to create immediate feedback loops and foster a proactive focus on observability, scalability, and failure modes.
Your fancy dashboards and CI/CD tools mean nothing if the core service is unreliable. We’ll explore why teams chase “DevOps marketing” over solid engineering and provide three battle-tested fixes to refocus on the “good work” that actually matters.
The Best DevOps is Just… Doing Good Work. Everything Else is Noise.
I remember this one project vividly. We had a junior engineer—let’s call him Alex—who was brilliant, but completely obsessed with implementing a service mesh. For weeks, every stand-up was about Istio, Linkerd, and sidecar proxies. He’d built impressive demos, had charts, the works. Meanwhile, PagerDuty was having a meltdown every night because prod-db-01, our main Postgres instance, was hitting 99% CPU and replication lag was climbing into the minutes. Alex was selling us the “marketing”—a cutting-edge, complex solution for a problem we didn’t have yet. The real, unglamorous “good work” was sitting right there in a ticket queue: optimizing a dozen slow queries and tuning `work_mem`. We were trying to buy a race car when the wheels on our family sedan were about to fall off.
The “Why”: Shiny Objects vs. Rusty Pipes
This situation isn’t unique, and it’s not Alex’s fault. It’s a cultural trap we fall into. The root cause is a misalignment of incentives. We call it “Resume-Driven Development” (RDD). It’s far more exciting to tell your next interviewer you implemented a globally-distributed service mesh than it is to say you spent a month adding indexes to a legacy database. Management gets sold on the “marketing” of a new tool, while the critical, foundational “good work” of maintenance and reliability gets labeled as “tech debt” and endlessly deferred.
The noise is the allure of the new. The work is fixing the old. The noise is a new dashboard. The work is making sure the data feeding it is accurate. We get so focused on the tooling and the process that we forget what it’s all for: delivering a reliable service to the end-user.
The Fixes: How to Focus on the “Good Work”
So how do you pull the team out of this spiral? You don’t do it with another tool. You do it with focus, process, and sometimes, a little bit of forced empathy.
1. The Quick Fix: The “Five Whys” Intervention
This is my go-to move when a team is spinning its wheels. You grab the key players, get in a room (virtual or physical) with a whiteboard, and you don’t let anyone leave until you’ve dug past the symptoms. It’s not about blame; it’s about brutal honesty.
Let’s say the symptom is “The `auth-service` is slow.”
- 1. Why is it slow? Because database queries are timing out.
- 2. Why are they timing out? Because the primary DB,
prod-db-01, has high CPU load. - 3. Why does it have high load? Because a specific query is running thousands of times per minute.
- 4. Why is that query running so often? Because a new feature deployed last week is doing a user lookup inside a loop instead of batching it (a classic N+1 problem).
- 5. Why was that code shipped? Because the performance impact wasn’t caught in code review or testing.
Suddenly, the solution isn’t “we need a bigger database” or “we need a caching layer.” The solution is to fix the bad code and improve the review process. It’s immediate, targeted, and focuses on the real problem. It’s a quick, hacky way to force the team to do the “good work” right now.
2. The Permanent Fix: The Boring Backlog & The SLO Mandate
A one-time intervention is great, but you need to make this cultural. The best way I’ve found is to make reliability a non-negotiable feature using Service Level Objectives (SLOs) and Error Budgets.
First, you create a dedicated “Reliability” or “Toil” backlog. This is where the unglamorous-but-critical work lives. Then, you define a clear SLO for your service. For example:
# slo.yaml
service: user-login-api
objective: Availability
target: 99.9% over a 28-day window
# This means we can have (100 - 99.9)% downtime.
# 0.1% of 28 days = ~40 minutes.
# Our "Error Budget" is 40 minutes per month.
The rule is simple: if you burn through your error budget for the month, all new feature development halts. The entire team’s priority shifts to the “Boring Backlog” to fix whatever is causing the instability. This aligns everyone. Product managers, developers, and ops are all on the same page. Reliability isn’t a “nice to have”; it’s a prerequisite for building anything new.
Pro Tip: Don’t make your SLOs perfect from day one. Start with a reasonable target you know you can hit, and slowly tighten it over time. The goal is to build the process and the culture, not to create an impossible standard and demoralize the team.
3. The ‘Nuclear’ Option: You Build It, You Run It (For Real)
This one is controversial, but in my experience, it’s the most effective fix of all. If a development team is consistently shipping code that breaks in production, you hand them the pager. Make the team that writes the code responsible for its operational health.
This isn’t about punishment. It’s about creating the shortest possible feedback loop. Nothing will make a developer care more about logging, metrics, and query performance than being woken up at 3 AM because their code brought down `prod-cart-service-b4f7d`.
Here’s how this often plays out in practice:
| Before (“The Noise”) | After (“The Good Work”) |
| Dev team focuses on shipping features. Ops team owns the pager and the cleanup. There’s a wall of confusion between them. | The feature team is now on-call for their service. They immediately start adding better dashboards, more robust tests, and fixing performance bugs they previously ignored. |
| “It works on my machine” is a common refrain. Production issues are seen as someone else’s problem. | Developers start thinking about failure modes, scalability, and observability during development, not after an outage. |
Warning: This requires significant management buy-in and a blameless culture. You must provide the teams with the training, tools, and authority to actually fix their services. If you just throw a pager at them without support, you’re just setting them up to fail.
At the end of the day, that Reddit thread was right. Our best work—the stuff that truly matters—isn’t the flashy new tech we implement. It’s the quiet, consistent, and often boring effort of making our systems stable, reliable, and robust. That’s the real work. Everything else is just noise.
🤖 Frequently Asked Questions
âť“ How can engineering teams effectively shift focus from ‘Resume-Driven Development’ to core reliability?
Teams can shift focus by employing the ‘Five Whys’ for root cause analysis, establishing SLOs with error budgets to prioritize reliability work, and implementing a ‘You Build It, You Run It’ model to foster direct operational ownership and accountability.
âť“ How does prioritizing ‘good work’ compare to adopting cutting-edge DevOps tools and practices?
Prioritizing ‘good work’ means addressing foundational reliability (e.g., query optimization, bug fixes) before or alongside adopting new tools. While cutting-edge tools offer potential, they become ‘noise’ if core services are unstable, whereas ‘good work’ ensures a stable base for any advanced tooling, making it more effective in the long run.
âť“ What is a common implementation pitfall for the ‘You Build It, You Run It’ model, and how can it be avoided?
A common pitfall is implementing ‘You Build It, You Run It’ without sufficient management buy-in, training, tools, or a blameless culture. To avoid this, teams must be provided with the necessary resources, authority, and support to actually fix their services, ensuring it’s about empowerment and learning, not just punishment.
Leave a Reply