Solved: Alert on MySQL Replication Lag via Datadog Custom Metric

🚀 Executive Summary

TL;DR: Manually checking MySQL replication lag via SSH is tedious and prone to missing critical spikes. This guide provides an automated solution using a Python script to fetch `Seconds_Behind_Master` and send it as a custom metric to Datadog, enabling proactive alerting and peace of mind.

🎯 Key Takeaways

The core of the solution is a Python script that connects to MySQL, executes `SHOW SLAVE STATUS`, parses `Seconds_Behind_Master`, and uses the Datadog API client to submit this value as a `mysql.replication.seconds_behind_master` gauge metric.
Secure handling of credentials is achieved by loading database and Datadog API keys from a `config.env` file using `python-dotenv`, keeping sensitive information out of the script itself.
Implementing the principle of least privilege for the MySQL monitoring user (granting only `REPLICATION CLIENT`) and consistently adding tags (e.g., `environment:production`, `db_host`) to custom Datadog metrics are crucial for security and effective filtering and dashboarding.

Alert on MySQL Replication Lag via Datadog Custom Metric

Hey team, Darian Vance here. I wanted to share a quick win that’s saved me a ton of headaches. I used to spend a good chunk of my Monday mornings manually SSH’ing into our various MySQL replicas to run `SHOW SLAVE STATUS`. It was a tedious, reactive process. After a near-miss where lag spiked silently over a weekend, I knew there had to be a better way. This simple Datadog integration turned that manual chore into an automated, proactive alert. It’s saved me hours and, more importantly, given our team real peace of mind. Let’s get it set up.

Prerequisites

A running MySQL replica instance.
A dedicated MySQL user with `REPLICATION CLIENT` privileges.
Python 3 installed on a host that can reach your replica.
A Datadog account with access to generate an API key and an Application key.

The Guide: Step-by-Step

Step 1: The Python Script for Monitoring

First, let’s craft the heart of our monitor: a Python script. This script will connect to our MySQL replica, check the replication status, and then post the lag time as a custom metric to Datadog. I’ll call the file check_replication_lag.py.

The logic is straightforward:

Load our database and Datadog credentials securely from a configuration file.
Connect to the replica and execute `SHOW SLAVE STATUS`.
Parse the output to find the `Seconds_Behind_Master` value. This is our key metric.
Use the Datadog API client to send this value as a gauge metric named mysql.replication.seconds_behind_master.


import os
import mysql.connector
from datetime import datetime
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v2.api.metrics_api import MetricsApi
from datadog_api_client.v2.model.metric_intake_type import MetricIntakeType
from datadog_api_client.v2.model.metric_payload import MetricPayload
from datadog_api_client.v2.model.metric_point import MetricPoint
from datadog_api_client.v2.model.metric_series import MetricSeries
from dotenv import load_dotenv

def get_replication_lag():
    """Connects to MySQL replica and returns Seconds_Behind_Master."""
    try:
        db_conn = mysql.connector.connect(
            host=os.getenv("DB_HOST"),
            user=os.getenv("DB_USER"),
            password=os.getenv("DB_PASSWORD"),
            database=os.getenv("DB_NAME")
        )
        cursor = db_conn.cursor(dictionary=True)
        cursor.execute("SHOW SLAVE STATUS")
        status = cursor.fetchone()
        
        if status and "Seconds_Behind_Master" in status and status["Seconds_Behind_Master"] is not None:
            return int(status["Seconds_Behind_Master"])
        else:
            # Handle cases where replication is not running or status is empty
            print("Could not determine replication lag. Is the replica running?")
            return None
            
    except mysql.connector.Error as err:
        print(f"MySQL Connection Error: {err}")
        return None
    finally:
        if 'db_conn' in locals() and db_conn.is_connected():
            cursor.close()
            db_conn.close()

def send_metric_to_datadog(lag_seconds):
    """Sends the lag value to Datadog as a custom metric."""
    configuration = Configuration()
    with ApiClient(configuration) as api_client:
        api_instance = MetricsApi(api_client)
        body = MetricPayload(
            series=[
                MetricSeries(
                    metric="mysql.replication.seconds_behind_master",
                    type=MetricIntakeType.GAUGE,
                    points=[
                        MetricPoint(
                            timestamp=int(datetime.now().timestamp()),
                            value=float(lag_seconds),
                        ),
                    ],
                    tags=[
                        "environment:production",
                        f"db_host:{os.getenv('DB_HOST')}"
                    ],
                ),
            ]
        )

        try:
            api_instance.submit_metrics(body)
            print(f"Successfully submitted metric to Datadog: {lag_seconds}s lag.")
        except Exception as e:
            print(f"Error submitting metric to Datadog: {e}")

if __name__ == "__main__":
    load_dotenv('config.env') # Load credentials
    
    # Check for Datadog keys
    if not os.getenv("DD_API_KEY") or not os.getenv("DD_APP_KEY"):
        print("Datadog API or APP key not found in config.env. Cannot proceed.")
        # We use a return here to avoid forbidden commands
    else:
        lag = get_replication_lag()
        if lag is not None:
            send_metric_to_datadog(lag)
        else:
            print("Failed to retrieve replication lag. Nothing sent to Datadog.")

Pro Tip: In my production setups, I always create a dedicated, read-only MySQL user just for monitoring. Never use your application’s user or, worse, the root user. Grant it only the `REPLICATION CLIENT` privilege; it’s all this script needs and it follows the principle of least privilege.

Step 2: Handle Dependencies

I’ll skip the standard virtual environment setup since you likely have your own workflow for that. The key is to get the necessary Python libraries installed. You’ll need `mysql-connector-python` to talk to our database, `datadog-api-client` for the Datadog API, and `python-dotenv` to handle our configuration securely. A quick pip command to install those three will get you sorted.

Step 3: Configuration File

Let’s create a file named config.env in the same directory as our script. This keeps our secrets out of the code. Populate it with your credentials.


# MySQL Replica Credentials
DB_HOST=your-replica-db-hostname.rds.amazonaws.com
DB_USER=monitoring_user
DB_PASSWORD='your_secure_password'
DB_NAME=information_schema

# Datadog API and App Keys
DD_API_KEY='your_datadog_api_key_here'
DD_APP_KEY='your_datadog_app_key_here'
DD_SITE='datadoghq.com'

Step 4: Schedule the Script

We need this script to run on a regular cadence. For a Linux-based host, a simple cron job is perfect. I find running it every 5 minutes provides a good balance between timely data and API usage.

You can set up a cron job to execute the script like this:


*/5 * * * * python3 /path/to/your/script/check_replication_lag.py

Step 5: Create the Monitor in Datadog

With data flowing, the final step is to create the alert in the Datadog UI.

Navigate to Monitors > New Monitor.
Select Metric as the monitor type.
Define the metric: In the “from” field, type our custom metric name: mysql.replication.seconds_behind_master. You can filter it by the tags we added, like `db_host`.
Set alert conditions: I typically set two thresholds. A “Warning” if the average value is `above 300` (5 minutes) and a critical “Alert” if it’s `above 900` (15 minutes).
Configure notifications: Add your team’s Slack channel or PagerDuty service. Write a clear message like: `MySQL replication lag for {{db_host.name}} is {{value}} seconds. Immediate investigation required. @ops-team`
Save the monitor, and you’re all set!

Pro Tip: Always add tags to your custom metrics! My script includes `environment:production` and the `db_host`. This makes it incredibly easy to build dashboards or create alerts that only apply to a specific host or your production environment, filtering out noise from staging or dev.

Common Pitfalls (Where I Usually Mess Up)

Firewall Rules: This gets me every time. Before you spend an hour debugging the script, make sure the host running the Python code can actually reach your MySQL replica’s IP and port (usually 3306).
Incorrect MySQL Privileges: If your monitoring user doesn’t have the `REPLICATION CLIENT` privilege, `SHOW SLAVE STATUS` will return an empty result. The script will report that it can’t find the lag, which is true, but the root cause is a permissions issue.
Datadog API/APP Key Mix-up: It’s easy to paste the API key into the APP key field or vice versa. Double-check them. Also, ensure you’re using keys from the correct Datadog organization if you have multiple.

Conclusion

And that’s it. For the cost of a small Python script and a cron job, you’ve built a robust, automated monitoring solution for a critical piece of your infrastructure. This frees you up from mundane manual checks and ensures you’re the first to know when replication starts to fall behind. Hope this helps you out.

– Darian

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ How can I monitor MySQL replication lag automatically?

Automate MySQL replication lag monitoring by deploying a Python script that queries `SHOW SLAVE STATUS` for `Seconds_Behind_Master` and sends this value as a custom gauge metric to Datadog via its API, scheduled by a cron job.

❓ How does this Datadog custom metric approach compare to manual replication lag checks?

This automated Datadog solution provides proactive, real-time alerting and historical data for trend analysis, eliminating the tedious, reactive, and error-prone manual SSH checks of `SHOW SLAVE STATUS` that can lead to missed critical lag spikes.

❓ What is a common implementation pitfall when setting up MySQL replication lag monitoring with Datadog?

A common pitfall is incorrect MySQL user privileges. The dedicated monitoring user must have `REPLICATION CLIENT` privilege; otherwise, `SHOW SLAVE STATUS` will return an empty result, preventing the script from accurately reporting lag.

TechResolve – SaaS Troubleshooting & Software Alternatives

Leave a ReplyCancel reply