🚀 Executive Summary

TL;DR: A mysterious 4s+ delay on WordPress admin-ajax.php calls, initially suspected to be plugin-related, was ultimately identified as CPU Steal in a shared cloud environment. The issue was diagnosed by observing the high ‘%st’ metric in the `top` command and resolved by upgrading to dedicated CPU instance types or migrating to more robust infrastructure.

🎯 Key Takeaways

  • CPU Steal is a virtualization-specific metric where a VM’s CPU is ready to work but is forced to wait for the physical CPU to become available, common in shared cloud environments.
  • High CPU Steal (consistently over 5-10%) can be identified by checking the ‘%st’ value in the `top` command output on your server.
  • Shared-core or ‘burstable’ cloud instance types (e.g., AWS t-series, GCP e2-series) are particularly susceptible to CPU Steal due to resource over-subscription.
  • Effective solutions include upgrading to dedicated CPU instance families (e.g., AWS m5, GCP n2) or, for persistent issues, migrating to dedicated hosts or managed WordPress hosts.

Debugged a 4s+ delay on admin-ajax.php. Turns out it wasn't the plugins, it was CPU Steal.

That mysterious multi-second lag on your WordPress admin-ajax.php calls probably isn’t your new plugin. We’ll show you how to identify and fix CPU Steal, the silent performance killer in shared cloud environments.

That 4-Second Lag Isn’t Your Plugin: A DevOps War Story on CPU Steal

I still remember the launch-day panic. We had a major e-commerce client, wp-prod-web-01 was buckling, and every single “Add to Cart” click was taking 4-5 seconds. The dev team was in a war room, frantically disabling plugins one by one, convinced the new shipping calculator was the culprit. New Relic was screaming about admin-ajax.php, but the PHP traces were clean—no single function was taking all the time. It felt like the server was just… thinking. After an hour of chasing ghosts in the application code, I finally SSH’d in, ran top, and saw it. A number I’d learned to dread: %st. The CPU wasn’t overloaded; it was being stolen.

So, What is This “CPU Steal” Anyway?

Before we go further, let’s clear this up. CPU Steal is a metric specific to virtualized environments—like the cloud server you’re probably running on AWS, GCP, or DigitalOcean. Think of it like this: your Virtual Machine (VM) is an apartment in a large building (the physical server). You’re promised a certain amount of electricity (CPU time), but you have noisy neighbors. When another VM on the same physical hardware goes crazy and starts a massive, resource-intensive job, the building’s hypervisor (the landlord) has to throttle everyone else to keep the lights on. That “stolen” time, where your VM’s CPU was ready to work but was forced to wait for the physical CPU to be available, is CPU Steal.

For WordPress, this is poison. A single request to admin-ajax.php can involve a dozen database queries and PHP functions. When each tiny operation is forced to wait an extra few milliseconds for the CPU to become available, those delays stack up into a multi-second nightmare. It’s not your code; it’s the infrastructure failing you silently.

You can spot it by running the top command on your server:

top - 13:37:01 up 15 days,  4:20,  1 user,  load average: 0.55, 0.65, 0.75
Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us,  3.1 sy,  0.0 ni, 43.3 id,  0.5 wa,  0.0 hi,  0.2 si, 40.4 st

See that last value? 40.4 st. That means for every 100 CPU cycles, over 40 of them were stolen from our VM. This is an emergency. Anything consistently over 5-10% is a major red flag that needs immediate attention.

Darian’s Pro Tip: Don’t wait for users to complain. Set up monitoring in Datadog, New Relic, or even a simple Zabbix agent to alert you when CPU Steal (`cpu.steal`) crosses a 5% threshold for more than five minutes. This is a critical health metric for any VM in a shared environment.

Okay, I’m Being Robbed. How Do I Stop It?

Once you’ve confirmed CPU Steal is the villain, blaming your plugins is off the table. Now we fix the environment. Here are your options, from the quick-and-dirty to the permanent solution.

Fix #1: The Cloud Shuffle (The Quick & Dirty)

This is the classic “turn it off and on again” for the cloud. When you stop and start your instance (not a reboot, a full stop then start), the cloud provider’s control plane will often re-provision it on a different physical host. If you’re lucky, you’ll land in a quieter “neighborhood” with less-abusive neighbors.

  • Pros: Fast, free, and can provide immediate relief.
  • Cons: It’s a gamble. You could land on an even noisier host. The problem will almost certainly return as new VMs are provisioned around you. It’s a temporary patch, not a fix.

I’ve used this trick during an outage to buy the team a few hours of stability while we planned a real fix. Don’t rely on it.

Fix #2: Upgrade Your Ride (The Permanent Fix)

The root cause is almost always using a “burstable” or shared-core instance type. These instances (like AWS t-series, GCP e2-series) are cheap because you’re explicitly agreeing to share CPU resources. When you need consistent performance, you have to pay for it by moving to an instance family with dedicated CPU cores.

This means moving away from instance types designed for low-baseline, spiky workloads and onto general-purpose or compute-optimized families.

Cloud Provider Problematic Instance Family (Shared) Better Instance Family (Dedicated)
AWS t2, t3, t4g m5, m6i, c5, c6i
GCP e2, f1-micro, g1-small n2, c3, m3
DigitalOcean Basic Droplets (Shared CPU) General Purpose / CPU-Optimized Droplets

Yes, this will cost more. But the cost of downtime, lost sales, and developer hours spent chasing ghosts is almost always higher. This is the correct, long-term solution for any production workload.

Fix #3: The Eviction Notice (The ‘Nuclear’ Option)

Sometimes, the problem isn’t just one noisy neighbor; it’s the entire building. If you’re on a budget cloud provider known for over-subscribing their hardware, or if even your upgraded instances are seeing steal, it might be time to move.

This can mean a few things:

  • Migrating to a different cloud: If you’re constantly fighting this on one provider, another might have better resource isolation.
  • Moving to a Dedicated Host: Most major clouds (like AWS and GCP) offer the option to rent an entire physical server for your exclusive use. This completely eliminates the noisy neighbor problem but comes at a significant cost premium.
  • Switching to a Managed WordPress Host: Platforms like Kinsta or WP Engine are built for this. They manage the underlying infrastructure and are architected to prevent this exact problem. You pay more, but you’re outsourcing the headache.

This is obviously the most disruptive option, but for mission-critical sites, if you can’t get stability after trying Fix #2, it has to be on the table. Your application’s performance is only as good as the foundation it’s built on. Don’t let a faulty foundation make you tear your hair out over perfectly good code.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ What is CPU Steal and how does it impact WordPress performance?

CPU Steal occurs when a virtual machine’s CPU is ready to execute tasks but is forced to wait because the physical CPU is busy with other VMs. For WordPress, this causes significant delays in operations like `admin-ajax.php` calls, as each small task waits for CPU availability, stacking up into multi-second lags.

âť“ How do the different solutions for CPU Steal compare?

The ‘Cloud Shuffle’ (stopping and starting an instance) is a quick, temporary gamble. Upgrading to dedicated CPU instance types (e.g., AWS m5, GCP n2) is the recommended, permanent solution for consistent performance, albeit at a higher cost. The ‘Eviction Notice’ (migrating clouds, dedicated hosts, or managed WordPress) is a more disruptive but ultimate solution for severe, persistent problems.

âť“ What is a common pitfall when troubleshooting performance issues related to CPU Steal?

A common pitfall is misdiagnosing the problem as application-level issues like faulty plugins or inefficient code. Without checking for CPU Steal via `top`’s ‘%st’ metric, developers can waste significant time optimizing code or disabling plugins when the root cause is infrastructure-related.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading