🚀 Executive Summary
TL;DR: Systemd’s parallel startup can cause race conditions where services fail because dependencies like databases aren’t fully initialized despite `After=` and `Requires=` directives. The most robust solution involves implementing connection retry logic with exponential backoff directly within the application code, making services inherently resilient and portable.
🎯 Key Takeaways
- Systemd’s `After=` and `Requires=` directives only ensure a dependency’s command has started, not that it’s fully initialized and ready to accept connections, leading to common startup failures.
- Implementing a polling script (e.g., `wait-for-postgres.sh` using `psql -c ‘\q’` or `pg_isready`) within the `ExecStart` directive of a systemd unit file provides a reliable infrastructure-level solution to ensure dependency readiness.
- The most resilient and portable solution is to build connection retry logic with exponential backoff directly into the application code, decoupling the service from systemd’s startup timing and enhancing its robustness across different environments.
Tired of your service failing on boot because a dependency isn’t ready? Here’s how to properly fix systemd startup order issues, from the quick hack to the permanent architectural solution.
Your Service Died on Reboot… Again. Let’s Talk systemd Dependencies.
It’s 2 AM. PagerDuty is screaming because the user-auth-service is down after a routine kernel patch and reboot on prod-app-01. I SSH in, heart pounding, run systemctl status user-auth-service, and see the dreaded ‘failed’ state. Annoyed, I run systemctl start user-auth-service and… it starts perfectly. The sinking feeling hits me: it’s a race condition. The service tried to connect to PostgreSQL on prod-db-01 before the database was finished initializing. We’ve all been there, and frankly, it’s an amateur-hour problem we need to solve for good.
The “Why”: systemd Doesn’t Know Your App is Ready
Before we dive into fixes, you have to understand the root cause. By default, systemd is built for speed. It tries to start as many services as possible in parallel. When you add After=postgresql.service and Requires=postgresql.service to your unit file, you’re telling systemd, “Hey, don’t start my service until you’ve started the command for PostgreSQL.”
Here’s the catch: systemd sees that the PostgreSQL process has been launched, marks it as “active,” and immediately moves on to start your service. It has no idea that PostgreSQL is still in the middle of loading its configuration, replaying WAL files, and actually opening port 5432 to accept connections. Your app comes online, tries to connect, gets a “connection refused,” and promptly dies.
The problem isn’t that systemd is broken; it’s that we’re not telling it the whole story. We need to tell it to wait until the dependency is *actually usable*.
Solution 1: The Quick & Dirty Fix (The “Sleep” Hack)
This is the first thing everyone tries. You’re in a panic, you need the service up, and you don’t have time to re-architect anything. You just add a delay to your service’s startup.
The How-To:
You modify the [Service] section of your unit file and add an ExecStartPre directive with a sleep command.
[Unit]
Description=User Authentication Service
Requires=postgresql.service
After=postgresql.service
[Service]
ExecStartPre=/bin/sleep 15
ExecStart=/usr/bin/user-auth-service
User=appuser
[Install]
WantedBy=multi-user.target
The Verdict:
I’ll be blunt: this is a terrible permanent solution. It’s a hack. It’s basically guessing how long the database *might* take to start. What if the DB server is under heavy load and takes 20 seconds next time? Your service fails again. Plus, you’ve just added a permanent 15-second delay to every single boot sequence, even when the DB is already up. But at 2 AM? It’ll get you back to bed. Use it, then promise me you’ll fix it properly in the morning.
Solution 2: The ‘Proper’ systemd Fix (The Polling Script)
This is the responsible, robust way to solve the problem within systemd. Instead of guessing with a blind sleep, we’ll actively check if the dependency is ready. We’ll write a small script that polls the database port and only allows our service to start when it gets a successful connection.
The How-To:
First, create a small shell script. Let’s call it /usr/local/bin/wait-for-postgres.sh.
#!/bin/bash
# wait-for-postgres.sh
set -e
host="prod-db-01"
port="5432"
cmd="$@"
until PGPASSWORD=$DB_PASSWORD psql -h "$host" -p "$port" -U "postgres" -c '\q'; do
>&2 echo "Postgres is unavailable - sleeping"
sleep 1
done
>&2 echo "Postgres is up - executing command"
exec $cmd
Pro Tip: Using
pg_isreadyis another great option here if you don’t want to deal with passwords. The core concept is the same: poll until you get a success signal.
Make sure to chmod +x /usr/local/bin/wait-for-postgres.sh. Now, you modify your systemd unit file to use this script to wrap your main executable.
[Unit]
Description=User Authentication Service
Requires=postgresql.service
After=postgresql.service
[Service]
# The script now handles waiting and executing
ExecStart=/usr/local/bin/wait-for-postgres.sh /usr/bin/user-auth-service
User=appuser
# You might need to pass the DB password via an EnvironmentFile
EnvironmentFile=/etc/default/user-auth-service
[Install]
WantedBy=multi-user.target
The Verdict:
This is a solid, reliable fix. It ensures your service only starts when its dependency is truly ready, without adding unnecessary delays. It keeps the resilience logic at the infrastructure layer, which is a classic and perfectly valid DevOps approach.
Solution 3: The ‘Architectural’ Fix (The “It’s Not My Problem” Option)
This is where you put your Lead Architect hat on. Why should the infrastructure be responsible for an application’s inability to handle a temporary dependency failure? Modern, cloud-native applications should be resilient. They shouldn’t just crash and burn if a database isn’t available for the first few seconds of their life.
The How-To:
The fix is to push for a change in the application code itself. The application should:
- Attempt to connect to the database on startup.
- If the connection fails, it should not exit.
- Instead, it should enter a retry loop, preferably with exponential backoff (e.g., wait 1s, then 2s, then 4s, etc.).
- It should log clearly that it is waiting for the database connection.
The Verdict:
This is the best solution. It makes the application inherently more robust and decouples it from systemd’s startup timing entirely. This service is now more portable—it will work just as well in a Docker container, in Kubernetes, or on a bare-metal VM without any special scripts. This moves the responsibility from the Ops side of DevOps to the Dev side, creating a more resilient system overall.
Darian’s Take: Don’t be afraid to push back on development teams for this. It might seem like more work up front, but building retry logic into the application is a fundamental tenet of distributed systems. Fixing it at the source is always better than patching over it with infrastructure hacks.
| Solution | Pros | Cons |
|---|---|---|
| 1. Sleep Hack | Extremely simple; fast to implement in an emergency. | Unreliable; slows down boot time; brittle. |
| 2. Polling Script | Very reliable; solves the problem at the infrastructure layer. | Adds another script to maintain; couples the service to the host OS. |
| 3. Architectural Fix | Most resilient; technology agnostic; correct long-term solution. | Requires application code changes; can face pushback from developers. |
So next time you get that 2 AM page, take a deep breath. Get the system back up with the quick fix if you must, but come morning, sit down with your team and implement a real, permanent solution. Your future self will thank you.
🤖 Frequently Asked Questions
âť“ Why does my systemd service fail on boot even with `After=` and `Requires=` directives for its database dependency?
Systemd’s `After=` and `Requires=` directives only ensure the dependency’s process has been launched, not that it’s fully initialized and ready to accept connections. Your service might attempt to connect prematurely, resulting in a ‘connection refused’ error and failure.
âť“ What are the trade-offs between using a `sleep` hack, a polling script, and in-application retry logic for systemd dependency management?
The `sleep` hack is quick but unreliable and adds unnecessary, fixed delays. A polling script is reliable at the infrastructure layer but adds another script to maintain. In-application retry logic is the most resilient and portable, making the application technology-agnostic, but requires code changes.
âť“ What is a common implementation pitfall when ensuring systemd service dependencies are ready, and how can it be avoided?
A common pitfall is assuming `After=` and `Requires=` guarantee a dependency is fully usable. This can be avoided by using a polling script (like `wait-for-postgres.sh`) in `ExecStart` to actively check dependency readiness, or ideally, by implementing retry logic with exponential backoff directly within the application.
Leave a Reply