🚀 Executive Summary

TL;DR: The article addresses the common “No space left on device” Docker errors on CI/CD runners, primarily caused by Docker’s aggressive caching of intermediate layers, dangling images, and anonymous volumes. It provides three solutions ranging from immediate manual intervention to automated cleanup and advanced ephemeral infrastructure to prevent disk bloat.

🎯 Key Takeaways

  • Docker’s default caching behavior, storing intermediate layers and unused resources, is the primary cause of “No space left on device” errors on CI/CD runners.
  • The `docker system prune -a –volumes -f` command offers an immediate, emergency solution by ruthlessly terminating all stopped containers, unused networks, dangling/unused images, and local volumes.
  • Automated garbage collection can be implemented via a cron job using `docker system prune -a –filter “until=24h” -f` and `docker builder prune –filter “until=24h” -f` to balance cache preservation with disk space management.
  • Ephemeral runners, which provision a new VM for each build and destroy it afterward, provide the ultimate solution for a sterile build environment, preventing disk bloat entirely, though at the cost of local cache and increased infrastructure complexity.

Weekly: Questions and advice

SEO Summary: A practical guide to diagnosing and resolving the infamous “No space left on device” Docker errors on CI/CD runners. Learn the root causes of container bloat and discover three proven strategies ranging from quick emergency clears to fully automated, ephemeral infrastructure.

The Silent Killer of CI/CD Pipelines: Slaying the “No Space Left on Device” Dragon

I still remember dragging myself out of bed at 2:00 AM on a Saturday because the on-call pager was screaming. The release for our biggest client was blocked. I SSH’d into ci-builder-prod-04, fully expecting a complex networking failure or a mysterious IAM permission drop. Instead, I was greeted by the most infuriating, amateur-hour error in the DevOps playbook: No space left on device. A junior dev had been iterating heavily on a massive machine-learning image, and Docker had silently hoarded every single byte of disk space until the VM choked to death. Listen, if you are hitting this wall, take a deep breath. You aren’t the first, and you certainly won’t be the last. Let’s fix it.

The “Why”: What is Actually Eating Your Disk?

I saw this question pop up in the weekly advice thread on Reddit, and the frustration was palpable. The developer was asking why their pipeline randomly fails on Friday afternoons. The culprit is almost always Docker’s caching mechanism.

By design, Docker is a packrat. Every time you build an image, it stores intermediate layers in the build cache. When a build fails, or when a container stops, those resources aren’t immediately deleted. You end up with “dangling” images (images with no tags), abandoned anonymous volumes, and a bloated build cache. Over a few days of heavy commits, your 100GB disk on ci-builder-prod-04 fills up completely. It’s not a bug; it’s a feature designed to make subsequent builds faster, but without garbage collection, it’s a ticking time bomb.

The Fixes: From Duct Tape to Architecture

When you are in the trenches, the solution you choose depends entirely on how much time you have before the deployment window closes. Here are the three ways I handle this.

1. The Quick Fix (The Emergency Release Valve)

If your pipeline is red, the release manager is breathing down your neck, and you just need the server to work right now, you need the emergency flush. I will admit, doing this manually is a bit hacky, but it works instantly.

docker system prune -a --volumes -f

Running this command will ruthlessly terminate all stopped containers, delete all networks not used by at least one container, wipe out all dangling and unused images, and nuke all unused local volumes. It will buy you the space you need to get the deployment out the door.

Pro Tip: Be incredibly careful running this on a production host like prod-db-01. If you have a stopped container that you intend to spin back up, this command will vaporize it. Only use this safely on dedicated, stateless build runners!

2. The Permanent Fix (Automated Garbage Collection)

We do not want to wake up at 2:00 AM again. If you are managing persistent CI runners, you need to automate the cleanup. The best way to do this is to add a recurring cron job on the runner, or better yet, append a cleanup step to the end of your pipeline configuration.

Instead of the scorched-earth approach above, we can set up a targeted filter that only deletes items older than 24 hours. This preserves your recent build cache (keeping builds fast) while preventing infinite disk growth.

# Add this to /etc/cron.daily/docker-cleanup
#!/bin/bash
docker system prune -a --filter "until=24h" -f
docker builder prune --filter "until=24h" -f

This is a solid, mature approach for mid-sized teams. You are acknowledging the technical debt and putting an automated sweeper in place to handle it.

3. The ‘Nuclear’ Option (Ephemeral Runners)

If you want to operate like an elite engineering team, you stop treating your CI runners like pets and start treating them like cattle. The ultimate fix to a corrupted or full build environment is to guarantee a fresh, sterile environment for every single job.

We moved TechResolve’s architecture over to ephemeral runners last year. When a webhook triggers a pipeline, our orchestrator provisions a brand new VM. The build runs. When the build finishes, the VM is instantly destroyed. It is physically impossible to run out of disk space from previous builds because the previous builds no longer exist.

Approach Pros Cons
Manual Prune Instant relief, zero setup required. Requires human intervention, ruins cache.
Cron Job Automated, balances space and cache speed. Disk can still fill up if a single build is massive.
Ephemeral Runners 100% sterile environment, impossible to bloat. No local cache (slower builds), complex infrastructure to manage.

My advice to you? Start with the Cron Job fix today to stop the bleeding. Then, put a ticket in the backlog to investigate ephemeral runners for next quarter. DevOps is an iterative process—fix the immediate pain, then engineer away the possibility of it happening again.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ How do I resolve ‘No space left on device’ errors caused by Docker on CI/CD runners?

You can resolve this by using `docker system prune -a –volumes -f` for an immediate fix, implementing an automated cron job with `docker system prune -a –filter “until=24h” -f` for persistent runners, or adopting ephemeral runners for a completely sterile build environment.

âť“ How do manual pruning, automated cron jobs, and ephemeral runners compare for managing Docker disk space on CI/CD?

Manual pruning provides instant relief but destroys the build cache. Automated cron jobs offer a balance, preserving recent cache while cleaning older items, but a single massive build can still fill the disk. Ephemeral runners guarantee a 100% sterile environment, making disk bloat impossible, but they lack local cache and require complex infrastructure management.

âť“ What is a common implementation pitfall when using `docker system prune`?

A common pitfall is running `docker system prune -a –volumes -f` on a production host (e.g., `prod-db-01`), as it will vaporize any stopped containers intended for future use. This command should only be used safely on dedicated, stateless build runners.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading