🚀 Executive Summary

TL;DR: Manually sifting through Docker logs for crash-looping containers is inefficient and reactive. This guide provides a Python script to proactively monitor specific Docker containers for frequent restarts within a defined time window, sending immediate alerts to prevent outages.

🎯 Key Takeaways

  • The solution leverages a Python script with `python-docker` to inspect container states, specifically the `FinishedAt` timestamp and `restart_count` attribute.
  • A local `restart_timestamps.json` state file is used to persist and prune historical restart events, enabling the script to accurately track restart frequency over a `TIME_WINDOW_MINUTES`.
  • Configuration parameters like `CONTAINER_NAME`, `SLACK_WEBHOOK_URL`, `RESTART_THRESHOLD`, and `TIME_WINDOW_MINUTES` are managed securely via a `config.env` file, promoting easy customization and separation of concerns.

Alert when a specific Docker Container Restarts too frequently

Alert when a specific Docker Container Restarts too frequently

Hey everyone, Darian here. Let’s talk about a silent killer of productivity: manual log checks. I used to spend a good chunk of my mornings sifting through Docker logs, looking for containers caught in a “crash loop.” It was a reactive, time-consuming process. Once I built a simple monitoring script for this, I got those hours back and, more importantly, I started learning about problems *before* they escalated into outages.

This quick guide will show you how to set up a Python script to do just that. It’s a simple, effective way to get proactive alerts when a service is misbehaving. Let’s dive in.

Prerequisites

  • Python 3 installed on your server.
  • Docker running on the same machine where the script will execute.
  • A notification endpoint, like a Slack Webhook URL. We’ll use Slack in this example, but you could easily adapt it for email, Teams, or anything with an API.

The Guide: Step-by-Step

Step 1: Prepare Your Environment

First, you’ll want to set up a dedicated directory for your project. I won’t walk through the standard virtual environment setup, as you probably have your own preferred workflow for managing Python projects. The key is to get a clean space to work.

You’ll need a few Python libraries. You can install them using pip. In your terminal, you’d run something like pip install python-docker python-dotenv requests. This gives us the tools to talk to Docker, manage our configuration, and send web requests for our alerts.

Inside your project directory, create two files: a configuration file named config.env and your Python script, which we’ll call monitor_restarts.py.

Step 2: The Configuration File

Your config.env file is where we’ll store all the variables. This keeps sensitive information like API keys out of your source code. It’s a simple key-value format.

# The exact name of the container you want to monitor
CONTAINER_NAME=your-app-container-name

# Your Slack incoming webhook URL
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

# The number of restarts that should trigger an alert
RESTART_THRESHOLD=5

# The time window in minutes to check for restarts
TIME_WINDOW_MINUTES=60

Step 3: The Python Monitoring Script

Now for the core logic. The script will connect to the Docker daemon, inspect a specific container, and check its restart count. The tricky part is knowing if the restarts are *frequent*. For that, we’ll maintain a small state file to log the timestamps of recent restarts.

Here is the complete code for monitor_restarts.py. I’ll break down how it works right after.

import docker
import os
import json
import requests
from datetime import datetime, timedelta, timezone
from dotenv import load_dotenv

# --- Configuration ---
load_dotenv('config.env')
CONTAINER_NAME = os.getenv('CONTAINER_NAME')
SLACK_WEBHOOK_URL = os.getenv('SLACK_WEBHOOK_URL')
RESTART_THRESHOLD = int(os.getenv('RESTART_THRESHOLD', 5))
TIME_WINDOW_MINUTES = int(os.getenv('TIME_WINDOW_MINUTES', 60))

STATE_FILE = 'restart_timestamps.json'

# --- Helper Functions ---
def load_timestamps():
    """Loads restart timestamps from the state file."""
    if not os.path.exists(STATE_FILE):
        return []
    with open(STATE_FILE, 'r') as f:
        try:
            return json.load(f)
        except json.JSONDecodeError:
            return []

def save_timestamps(timestamps):
    """Saves restart timestamps to the state file."""
    with open(STATE_FILE, 'w') as f:
        json.dump(timestamps, f)

def send_slack_alert(message):
    """Sends a formatted message to a Slack webhook."""
    if not SLACK_WEBHOOK_URL:
        print("ALERT (Slack URL not configured):", message)
        return
    try:
        payload = {'text': message}
        requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=10)
        print("Slack alert sent successfully.")
    except requests.RequestException as e:
        print(f"Error sending Slack alert: {e}")

# --- Main Logic ---
def check_container_restarts():
    """Main function to check container restarts and send alerts."""
    print(f"[{datetime.now()}] Running check for container: {CONTAINER_NAME}")

    try:
        client = docker.from_env()
        container = client.containers.get(CONTAINER_NAME)
    except docker.errors.NotFound:
        print(f"Error: Container '{CONTAINER_NAME}' not found.")
        return
    except docker.errors.DockerException as e:
        print(f"Error connecting to Docker daemon: {e}")
        return

    # The 'FinishedAt' timestamp is only for the *last* exit.
    # It indicates when the last restart event occurred.
    finished_at_str = container.attrs['State']['FinishedAt']
    
    # Docker's timestamps can have varying precision, so we parse what we can
    try:
        # Handle 'Z' for UTC and high-precision fractional seconds
        if finished_at_str.endswith('Z'):
            finished_at_str = finished_at_str[:-1] + '+00:00'
        
        # Truncate to 6 decimal places for microseconds
        if '.' in finished_at_str:
            parts = finished_at_str.split('.')
            parts[1] = parts[1][:6]
            finished_at_str = '.'.join(parts)
            
        finished_at = datetime.fromisoformat(finished_at_str)
    except (ValueError, TypeError):
         # If FinishedAt is invalid (e.g., '0001-01-01...'), the container is likely running fine
         print("Container is running or has no valid finish time. No action needed.")
         return

    timestamps = load_timestamps()
    
    # Check if this restart event is new
    if finished_at.isoformat() not in [ts['time'] for ts in timestamps]:
        print(f"New restart detected at: {finished_at.isoformat()}")
        timestamps.append({'time': finished_at.isoformat(), 'count': container.restart_count})

    # Prune old timestamps that are outside our time window
    now = datetime.now(timezone.utc)
    time_window = timedelta(minutes=TIME_WINDOW_MINUTES)
    
    recent_restarts = [
        ts for ts in timestamps 
        if now - datetime.fromisoformat(ts['time']) <= time_window
    ]

    save_timestamps(recent_restarts)

    # Check if we need to alert
    if len(recent_restarts) >= RESTART_THRESHOLD:
        message = (
            f":warning: *ALERT:* Container `{CONTAINER_NAME}` has restarted "
            f"`{len(recent_restarts)}` times in the last `{TIME_WINDOW_MINUTES}` minutes. "
            f"Total restarts: `{container.restart_count}`."
        )
        send_slack_alert(message)
        # To avoid spamming, we clear the log after an alert.
        # In a more advanced setup, you'd implement alert silencing.
        save_timestamps([]) 
    else:
        print(f"Found {len(recent_restarts)} recent restarts. Threshold is {RESTART_THRESHOLD}. No alert.")


if __name__ == "__main__":
    check_container_restarts()

Pro Tip: The logic here is stateless from the script’s perspective—all state is managed in the restart_timestamps.json file. This makes it resilient. If the script fails to run for a few minutes, it will catch up on the next execution by reading the state file and the container’s current status from Docker.

Step 4: Schedule the Script

This script is designed to be run periodically. A cron job is perfect for this. We want to check frequently enough to catch problems early, so running it every minute is a good starting point.

You can set up a cron job that looks like this. Remember to run this from the directory containing your script and config file, or use appropriate paths.

* * * * * python3 monitor_restarts.py

Common Pitfalls

Here are a few spots where I’ve tripped up in the past.

  • Docker Socket Permissions: By far the most common issue. The user running the script needs permission to communicate with the Docker daemon. If you see a “permission denied” error related to docker.sock, the easiest fix is to add the user to the docker group.
  • Incorrect Container Name: Double-check the CONTAINER_NAME in your config.env file. It must be the exact name of the running container, not the image name. Use docker ps to verify.
  • State File Permissions: The script needs to be able to read and write restart_timestamps.json. Make sure the user running the cron job has the correct file permissions in the project directory.
  • Timezone Awareness: Docker’s API provides timestamps in UTC. My script normalizes this, but it’s something to be aware of if you start modifying the logic. Always handle time in a consistent timezone (UTC is my standard) to avoid painful off-by-one-hour bugs.

Conclusion

And that’s it. You now have a robust, lightweight monitoring system for your critical containers. It’s a “set it and forget it” solution that brings real peace of mind. In my production setups, I expand on this with more sophisticated alert silencing and logging, but this script is the foundation I build upon. It’s about turning unknown unknowns into known, actionable alerts.

Happy monitoring!

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ How does the script determine if a Docker container is restarting too frequently?

The script monitors the `FinishedAt` timestamp of a specified container. It stores recent restart timestamps in a `restart_timestamps.json` state file and counts how many restarts occurred within the `TIME_WINDOW_MINUTES`. If this count meets or exceeds the `RESTART_THRESHOLD`, an alert is triggered.

âť“ How does this compare to alternatives for Docker container restart monitoring?

This custom Python script offers a lightweight, self-hosted solution for proactive, specific container restart alerting, providing immediate notifications. Unlike Docker’s built-in restart policies which focus on ensuring uptime, this script focuses on alerting when a container is misbehaving. It’s a simpler alternative to complex external monitoring tools, offering direct control and customization.

âť“ What are common implementation pitfalls when setting up this Docker restart monitoring script?

Common pitfalls include `docker.sock` permission issues (the user running the script needs access to the Docker daemon), an incorrect `CONTAINER_NAME` in the `config.env` file, insufficient file permissions for the `restart_timestamps.json` state file, and potential timezone awareness problems when parsing Docker’s UTC timestamps if not handled consistently.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading