🚀 Executive Summary

TL;DR: The original Python-based Locust K8s Operator (v1) suffered from slow test startup times, high resource usage, and poor observability, hindering efficient performance testing. The new Go-based Locust K8s Operator (v2.0) resolves these issues with a complete rewrite, offering significantly faster test startup, reduced resource consumption, native OpenTelemetry support, and a seamless, zero-downtime migration path.

🎯 Key Takeaways

Locust K8s Operator v2.0 is a complete Go rewrite, addressing architectural bottlenecks of the Python-based v1, leading to faster reconciliation loops and improved efficiency in managing pod lifecycles.
The v2.0 operator dramatically reduces test startup times (from 5-10 minutes to under 1 minute for 50 workers) and significantly lowers the operator’s resource consumption on control plane nodes.
V2.0 introduces native OpenTelemetry support for enhanced observability and provides a zero-downtime migration path from v1 by allowing the new operator to adopt and manage existing `LocustTest` Custom Resource Definitions (CRDs).

Locust K8s Operator v2.0: Complete Go rewrite with faster startup, OpenTelemetry Support, and zero-downtime v1→v2 migration

The new Locust Kubernetes Operator v2.0, a complete Go rewrite, slashes test startup times and adds native OpenTelemetry support, offering a seamless, zero-downtime migration path from the previous version.

Locust on K8s Was Slowing Us Down. Here’s How the v2 Operator Fixed It.

I still get a cold sweat thinking about it. It was 2 AM, the night before the big ‘Black Friday’ code freeze. We were trying to run a final soak test on our new `checkout-service`, and the Locust cluster on K8s was taking a solid 10 minutes just to spin up the worker pods. Ten minutes. Every time we tweaked a parameter, another ten minutes down the drain. The SRE lead was pinging me on Slack every 90 seconds. That night, I swore I’d find a better way. If you’ve felt that pain, you know exactly why the new Go-based Locust Operator is such a big deal.

So, What Was The Actual Problem?

Let’s be real, the original Locust K8s operator was a lifesaver. It got the job done. But it was written in Python, and while I love Python for scripting and services, it showed its limitations when managing hundreds of pod lifecycles at scale. The reconciliation loop could get bogged down, especially in busy clusters. This led to:

Slow Test Startup: The operator would take its sweet time creating and configuring all the worker pods.
Resource Intensive: It wasn’t the most lightweight controller, sometimes hogging CPU on the control plane nodes.
Observability Gaps: Getting deep performance metrics out of the operator itself and into our OpenTelemetry collector was a manual, often hacky, process.

The core issue wasn’t a bug; it was an architectural bottleneck. The rewrite in Go addresses this head-on. Go is compiled, concurrent, and built for this kind of systems-level task. The result is a controller that’s ridiculously fast and efficient.

The Fixes: From Band-Aid to Brain Surgery

Alright, you’re sold on the ‘why’, now for the ‘how’. Depending on your situation, you’ve got a few paths forward. Here’s how we’ve handled it at TechResolve across different teams.

1. The Quick Fix: Tuning the Old v1 Engine

Maybe you can’t migrate to v2 today. Your change control board is booked, or you just don’t have the time. I get it. The simplest thing you can do is throw more resources at the v1 operator. It’s a band-aid, not a cure, but it can get you through a tough spot.

Find your operator deployment and crank up the CPU/Memory requests and limits.


# Find your operator deployment
kubectl get deployment -n locust-system

# Edit the deployment (e.g., locust-operator-controller-manager)
kubectl edit deployment locust-operator-controller-manager -n locust-system

Then, inside the editor, find the `resources` section and give it a boost:


...
spec:
  template:
    spec:
      containers:
      - name: manager
        resources:
          limits:
            cpu: 1000m  # Was 500m
            memory: 512Mi # Was 128Mi
          requests:
            cpu: 500m   # Was 100m
            memory: 256Mi # Was 64Mi
...

This doesn’t fix the underlying inefficiency, but it can stop the operator from getting throttled on a busy cluster, which might just cut your startup time in half.

2. The Permanent Fix: The Zero-Downtime Migration to v2

This is the path you want to be on. The Locust team did a fantastic job making this upgrade smooth. The key is that the v2 operator can adopt and manage the Custom Resource Definitions (CRDs) from v1. You’re essentially doing a hot-swap of the operator’s brain.

Step 1: First, scale down the old v1 operator. Don’t delete it, just put it to sleep. This prevents two controllers from fighting over the same resources.


kubectl scale deployment locust-operator-controller-manager --replicas=0 -n locust-system

Step 2: Apply the new v2 operator manifest. This will install the new Go-based controller. It’s smart enough to see the existing `LocustTest` CRDs and take over management of any running tests.


# Download the latest release from the official repo
kubectl apply -f https://github.com/Locust/locust-operator/releases/download/v2.0.0/locust-operator.yaml

Step 3: Once the new operator pod is running, you can safely delete the old v1 deployment.


kubectl delete deployment locust-operator-controller-manager -n locust-system

That’s it. Your existing tests will continue running, but any new tests or changes will be handled by the new, lightning-fast operator. The first time you spin up a 50-worker test in under a minute, you’ll see what I mean.

Heads Up: The CRD API version has changed. While v2 understands the old `v1alpha1`, you should plan to update your `LocustTest` YAML files to the new `locust.io/v1beta1` spec to take advantage of new features like OpenTelemetry integration. It’s not urgent, but it’s technical debt you should pay down.

3. The ‘Nuclear’ Option: A Fresh v2 Install

If you’re setting up a new performance testing environment or your existing one is non-critical, just rip it all out. Sometimes a clean slate is the best way to avoid weird transitional states.

This is for the `perf-testing-cluster` that you can afford to wipe, not `prod-monitoring-cluster-01`.


# Uninstall the old operator AND its CRDs
helm uninstall locust -n locust-system # Or however you installed it

# Make sure the CRDs are gone
kubectl get crd | grep locust

# If any remain, delete them manually
kubectl delete crd locusttests.locust.io

# Now, install v2 from scratch
kubectl apply -f https://github.com/Locust/locust-operator/releases/download/v2.0.0/locust-operator.yaml

This is the cleanest approach but requires downtime for your testing framework. For us, we used this on our dev cluster and followed the migration path (Fix #2) for our shared staging environment.

The Payoff: V1 vs. V2

The difference is night and day. Here’s a quick breakdown of what you’re gaining:

Feature	Operator v1 (Python)	Operator v2 (Go)
Startup Time (50 workers)	5-10 minutes	< 1 minute
Resource Usage	Moderate-High CPU	Very Low
Observability	Manual (scrape Prometheus)	Native OpenTelemetry Support

Moving to v2 wasn’t just an upgrade; it was a quality-of-life improvement for my entire team. We can iterate on performance tests faster, which means we catch issues sooner. And that means fewer 2 AM fire drills. Trust me, your sleep schedule will thank you.

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ What are the primary advantages of upgrading to Locust K8s Operator v2.0?

Locust K8s Operator v2.0, a Go rewrite, offers dramatically faster test startup times (under 1 minute for 50 workers), significantly lower resource consumption on the control plane, and native OpenTelemetry support for improved observability.

❓ How does the Go-based Locust Operator v2.0 differ from its Python v1 predecessor?

V2.0, written in Go, provides superior performance with startup times under 1 minute compared to v1’s 5-10 minutes. It also has very low resource usage versus v1’s moderate-high CPU, and offers native OpenTelemetry support, unlike v1’s manual Prometheus scraping.

❓ What is the recommended procedure for a zero-downtime migration from Locust Operator v1 to v2?

First, scale down the v1 operator deployment to zero replicas. Then, apply the v2 operator manifest, which will adopt existing `LocustTest` CRDs. Once v2 is running, safely delete the old v1 deployment. Users should also plan to update `LocustTest` YAMLs to the new `locust.io/v1beta1` spec for new features.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply