🚀 Executive Summary
TL;DR: Manually sifting through Docker logs for crash-looping containers is inefficient and reactive. This guide provides a Python script to proactively monitor specific Docker containers for frequent restarts within a defined time window, sending immediate alerts to prevent outages.
🎯 Key Takeaways
- The solution leverages a Python script with `python-docker` to inspect container states, specifically the `FinishedAt` timestamp and `restart_count` attribute.
- A local `restart_timestamps.json` state file is used to persist and prune historical restart events, enabling the script to accurately track restart frequency over a `TIME_WINDOW_MINUTES`.
- Configuration parameters like `CONTAINER_NAME`, `SLACK_WEBHOOK_URL`, `RESTART_THRESHOLD`, and `TIME_WINDOW_MINUTES` are managed securely via a `config.env` file, promoting easy customization and separation of concerns.
Alert when a specific Docker Container Restarts too frequently
Hey everyone, Darian here. Let’s talk about a silent killer of productivity: manual log checks. I used to spend a good chunk of my mornings sifting through Docker logs, looking for containers caught in a “crash loop.” It was a reactive, time-consuming process. Once I built a simple monitoring script for this, I got those hours back and, more importantly, I started learning about problems *before* they escalated into outages.
This quick guide will show you how to set up a Python script to do just that. It’s a simple, effective way to get proactive alerts when a service is misbehaving. Let’s dive in.
Prerequisites
- Python 3 installed on your server.
- Docker running on the same machine where the script will execute.
- A notification endpoint, like a Slack Webhook URL. We’ll use Slack in this example, but you could easily adapt it for email, Teams, or anything with an API.
The Guide: Step-by-Step
Step 1: Prepare Your Environment
First, you’ll want to set up a dedicated directory for your project. I won’t walk through the standard virtual environment setup, as you probably have your own preferred workflow for managing Python projects. The key is to get a clean space to work.
You’ll need a few Python libraries. You can install them using pip. In your terminal, you’d run something like pip install python-docker python-dotenv requests. This gives us the tools to talk to Docker, manage our configuration, and send web requests for our alerts.
Inside your project directory, create two files: a configuration file named config.env and your Python script, which we’ll call monitor_restarts.py.
Step 2: The Configuration File
Your config.env file is where we’ll store all the variables. This keeps sensitive information like API keys out of your source code. It’s a simple key-value format.
# The exact name of the container you want to monitor
CONTAINER_NAME=your-app-container-name
# Your Slack incoming webhook URL
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
# The number of restarts that should trigger an alert
RESTART_THRESHOLD=5
# The time window in minutes to check for restarts
TIME_WINDOW_MINUTES=60
Step 3: The Python Monitoring Script
Now for the core logic. The script will connect to the Docker daemon, inspect a specific container, and check its restart count. The tricky part is knowing if the restarts are *frequent*. For that, we’ll maintain a small state file to log the timestamps of recent restarts.
Here is the complete code for monitor_restarts.py. I’ll break down how it works right after.
import docker
import os
import json
import requests
from datetime import datetime, timedelta, timezone
from dotenv import load_dotenv
# --- Configuration ---
load_dotenv('config.env')
CONTAINER_NAME = os.getenv('CONTAINER_NAME')
SLACK_WEBHOOK_URL = os.getenv('SLACK_WEBHOOK_URL')
RESTART_THRESHOLD = int(os.getenv('RESTART_THRESHOLD', 5))
TIME_WINDOW_MINUTES = int(os.getenv('TIME_WINDOW_MINUTES', 60))
STATE_FILE = 'restart_timestamps.json'
# --- Helper Functions ---
def load_timestamps():
"""Loads restart timestamps from the state file."""
if not os.path.exists(STATE_FILE):
return []
with open(STATE_FILE, 'r') as f:
try:
return json.load(f)
except json.JSONDecodeError:
return []
def save_timestamps(timestamps):
"""Saves restart timestamps to the state file."""
with open(STATE_FILE, 'w') as f:
json.dump(timestamps, f)
def send_slack_alert(message):
"""Sends a formatted message to a Slack webhook."""
if not SLACK_WEBHOOK_URL:
print("ALERT (Slack URL not configured):", message)
return
try:
payload = {'text': message}
requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=10)
print("Slack alert sent successfully.")
except requests.RequestException as e:
print(f"Error sending Slack alert: {e}")
# --- Main Logic ---
def check_container_restarts():
"""Main function to check container restarts and send alerts."""
print(f"[{datetime.now()}] Running check for container: {CONTAINER_NAME}")
try:
client = docker.from_env()
container = client.containers.get(CONTAINER_NAME)
except docker.errors.NotFound:
print(f"Error: Container '{CONTAINER_NAME}' not found.")
return
except docker.errors.DockerException as e:
print(f"Error connecting to Docker daemon: {e}")
return
# The 'FinishedAt' timestamp is only for the *last* exit.
# It indicates when the last restart event occurred.
finished_at_str = container.attrs['State']['FinishedAt']
# Docker's timestamps can have varying precision, so we parse what we can
try:
# Handle 'Z' for UTC and high-precision fractional seconds
if finished_at_str.endswith('Z'):
finished_at_str = finished_at_str[:-1] + '+00:00'
# Truncate to 6 decimal places for microseconds
if '.' in finished_at_str:
parts = finished_at_str.split('.')
parts[1] = parts[1][:6]
finished_at_str = '.'.join(parts)
finished_at = datetime.fromisoformat(finished_at_str)
except (ValueError, TypeError):
# If FinishedAt is invalid (e.g., '0001-01-01...'), the container is likely running fine
print("Container is running or has no valid finish time. No action needed.")
return
timestamps = load_timestamps()
# Check if this restart event is new
if finished_at.isoformat() not in [ts['time'] for ts in timestamps]:
print(f"New restart detected at: {finished_at.isoformat()}")
timestamps.append({'time': finished_at.isoformat(), 'count': container.restart_count})
# Prune old timestamps that are outside our time window
now = datetime.now(timezone.utc)
time_window = timedelta(minutes=TIME_WINDOW_MINUTES)
recent_restarts = [
ts for ts in timestamps
if now - datetime.fromisoformat(ts['time']) <= time_window
]
save_timestamps(recent_restarts)
# Check if we need to alert
if len(recent_restarts) >= RESTART_THRESHOLD:
message = (
f":warning: *ALERT:* Container `{CONTAINER_NAME}` has restarted "
f"`{len(recent_restarts)}` times in the last `{TIME_WINDOW_MINUTES}` minutes. "
f"Total restarts: `{container.restart_count}`."
)
send_slack_alert(message)
# To avoid spamming, we clear the log after an alert.
# In a more advanced setup, you'd implement alert silencing.
save_timestamps([])
else:
print(f"Found {len(recent_restarts)} recent restarts. Threshold is {RESTART_THRESHOLD}. No alert.")
if __name__ == "__main__":
check_container_restarts()
Pro Tip: The logic here is stateless from the script’s perspective—all state is managed in the
restart_timestamps.jsonfile. This makes it resilient. If the script fails to run for a few minutes, it will catch up on the next execution by reading the state file and the container’s current status from Docker.
Step 4: Schedule the Script
This script is designed to be run periodically. A cron job is perfect for this. We want to check frequently enough to catch problems early, so running it every minute is a good starting point.
You can set up a cron job that looks like this. Remember to run this from the directory containing your script and config file, or use appropriate paths.
* * * * * python3 monitor_restarts.py
Common Pitfalls
Here are a few spots where I’ve tripped up in the past.
-
Docker Socket Permissions: By far the most common issue. The user running the script needs permission to communicate with the Docker daemon. If you see a “permission denied” error related to
docker.sock, the easiest fix is to add the user to thedockergroup. -
Incorrect Container Name: Double-check the
CONTAINER_NAMEin yourconfig.envfile. It must be the exact name of the running container, not the image name. Usedocker psto verify. -
State File Permissions: The script needs to be able to read and write
restart_timestamps.json. Make sure the user running the cron job has the correct file permissions in the project directory. - Timezone Awareness: Docker’s API provides timestamps in UTC. My script normalizes this, but it’s something to be aware of if you start modifying the logic. Always handle time in a consistent timezone (UTC is my standard) to avoid painful off-by-one-hour bugs.
Conclusion
And that’s it. You now have a robust, lightweight monitoring system for your critical containers. It’s a “set it and forget it” solution that brings real peace of mind. In my production setups, I expand on this with more sophisticated alert silencing and logging, but this script is the foundation I build upon. It’s about turning unknown unknowns into known, actionable alerts.
Happy monitoring!
🤖 Frequently Asked Questions
âť“ How does the script determine if a Docker container is restarting too frequently?
The script monitors the `FinishedAt` timestamp of a specified container. It stores recent restart timestamps in a `restart_timestamps.json` state file and counts how many restarts occurred within the `TIME_WINDOW_MINUTES`. If this count meets or exceeds the `RESTART_THRESHOLD`, an alert is triggered.
âť“ How does this compare to alternatives for Docker container restart monitoring?
This custom Python script offers a lightweight, self-hosted solution for proactive, specific container restart alerting, providing immediate notifications. Unlike Docker’s built-in restart policies which focus on ensuring uptime, this script focuses on alerting when a container is misbehaving. It’s a simpler alternative to complex external monitoring tools, offering direct control and customization.
âť“ What are common implementation pitfalls when setting up this Docker restart monitoring script?
Common pitfalls include `docker.sock` permission issues (the user running the script needs access to the Docker daemon), an incorrect `CONTAINER_NAME` in the `config.env` file, insufficient file permissions for the `restart_timestamps.json` state file, and potential timezone awareness problems when parsing Docker’s UTC timestamps if not handled consistently.
Leave a Reply