🚀 Executive Summary
TL;DR: Manually checking Ansible logs for failures is tedious and error-prone. This guide provides a solution to automate monitoring Ansible playbook failures by configuring Ansible to log output, creating a Slack Incoming Webhook, and using a Python script scheduled with cron to send real-time notifications to Slack.
🎯 Key Takeaways
- Configure Ansible’s `ansible.cfg` with `log_path = ./ansible.log` to centralize all playbook execution output into a predictable file.
- Develop a Python script (`ansible_monitor.py`) that utilizes `requests` and `python-dotenv` to parse the Ansible log for new ‘FATAL’ or ‘FAILED’ entries since the last run, using a `last_run.txt` state file.
- Securely manage the Slack Incoming Webhook URL and Ansible log file path using a `config.env` file, loaded via `python-dotenv`, and ensure `config.env` and log files are added to `.gitignore`.
Monitor Ansible Playbook Failures and Report to Slack
Hey there, Darian Vance here. As a Senior DevOps Engineer at TechResolve, I’ve seen my fair share of silent failures. I used to spend the first 30 minutes of my day manually checking Ansible logs across a fleet of servers. It was tedious, error-prone, and a terrible way to start the morning. The worst part? Sometimes I’d only find a critical failure from the night before when a developer pinged me about a broken environment.
I realized I was wasting hours every week on a task a simple script could automate. So, I built this workflow to push failure notifications directly to a Slack channel. It turned my reactive log-checking into proactive, real-time alerting. Let me walk you through how to set it up; it’s a real time-saver.
Prerequisites
Before we dive in, make sure you have a few things ready. This guide assumes you’re comfortable with the basics.
- Ansible: Already installed and configured on the machine where your playbooks run.
- Python 3: You’ll need Python 3 installed to run our monitoring script.
- Slack Workspace: You need permissions to add an app and create an Incoming Webhook.
- A Scheduler: Something like cron on Linux/macOS or Task Scheduler on Windows to run the script automatically.
The Step-by-Step Guide
Step 1: Configure Ansible to Log Output
First things first, we need Ansible to write its output to a predictable file. If it isn’t doing so already, you can easily configure this in your ansible.cfg file. In my projects, I usually place this file in the root of my Ansible repository.
Add these lines to the [defaults] section of your ansible.cfg:
[defaults]
log_path = ./ansible.log
This tells Ansible to append all playbook execution output to a file named ansible.log in the same directory. Now, whenever you run a playbook, you’ll have a clean, timestamped record of what happened.
Step 2: Create a Slack Incoming Webhook
This is our communication channel. The webhook gives us a unique URL that we can send messages to, and they’ll appear in the Slack channel of our choice.
- Go to
api.slack.com/appsand click “Create New App”. - Choose “From scratch”, give it a name like “Ansible Failure Monitor”, and select your workspace.
- From the app’s dashboard, select “Incoming Webhooks” from the feature list on the left.
- Toggle the “Activate Incoming Webhooks” switch to On.
- Click “Add New Webhook to Workspace”. Choose the channel you want notifications to go to (I recommend a dedicated channel like
#ansible-alerts) and click “Allow”. - Slack will generate a Webhook URL for you. It will look something like
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX.
Crucially: Treat this URL like a password. Do not commit it to your Git repository. We’ll store it securely in an environment file.
Step 3: Write the Python Monitoring Script
Alright, let’s get to the core logic. We’ll write a Python script that reads the log file, finds new failures, and sends a formatted message to our Slack webhook.
For this setup, I’ll skip the standard virtualenv creation since you likely have your own workflow for that. Just make sure you create a project directory and install the necessary Python packages, which are requests (for sending the web request to Slack) and python-dotenv (for managing our secret webhook URL). You can install these with pip.
Let’s organize our project with three files:
config.env: To store our secret webhook URL and configuration.ansible_monitor.py: Our main script.last_run.txt: The script will create and manage this file to remember the last time it checked the logs.
A. Create the Configuration File
In your project directory, create a file named config.env and add your webhook URL and the path to your log file.
# config.env
SLACK_WEBHOOK_URL="your_slack_webhook_url_here"
ANSIBLE_LOG_FILE="./ansible.log"
Pro Tip: Always add your configuration and state files, like
config.envand*.log, to your.gitignorefile. You never want to commit secrets or large, machine-generated log files to version control.
B. The Python Script: ansible_monitor.py
Here’s the full script. I’ve added comments to explain what each part does. The main idea is to only process log entries that have appeared since the script was last run.
# ansible_monitor.py
import os
import requests
from datetime import datetime, timezone, timedelta
from dotenv import load_dotenv
# State file to store the timestamp of the last check
STATE_FILE = "last_run.txt"
def get_last_run_time():
"""Reads the timestamp from the state file.
If the file doesn't exist, it defaults to 1 hour ago.
"""
try:
with open(STATE_FILE, 'r') as f:
# The timestamp is stored in ISO 8601 format
return datetime.fromisoformat(f.read().strip())
except FileNotFoundError:
# If we've never run, only check the last hour to avoid a flood of old alerts
return datetime.now(timezone.utc) - timedelta(hours=1)
def save_current_run_time(run_time):
"""Saves the current timestamp to the state file."""
with open(STATE_FILE, 'w') as f:
f.write(run_time.isoformat())
def parse_ansible_log(log_path, last_run_time):
"""Parses the Ansible log for new failures."""
failures = []
try:
with open(log_path, 'r') as log_file:
for line in log_file:
try:
# Ansible log format: "2023-10-27 10:30:00,123 - ..."
log_time_str = line.split(',')[0]
log_time = datetime.strptime(log_time_str, "%Y-%m-%d %H:%M:%S")
log_time_utc = log_time.replace(tzinfo=timezone.utc) # Assuming server logs in UTC
if log_time_utc > last_run_time:
if "FATAL" in line or "FAILED" in line:
# We found a failure that's new since our last check
failures.append(line.strip())
except (ValueError, IndexError):
# Ignore lines that don't match the expected timestamp format
continue
except FileNotFoundError:
print(f"Log file not found at: {log_path}")
return [] # Return an empty list if log file does not exist
return failures
def send_slack_notification(webhook_url, failures):
"""Sends a formatted failure report to Slack."""
if not failures:
return # Don't send anything if there are no new failures
# Join all failure lines into a single string for the message body
message_body = "\n".join(failures)
payload = {
"text": f"🚨 Ansible Playbook Failure Detected!",
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "🚨 Ansible Playbook Failure Detected!"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"Found {len(failures)} new failure(s) in the logs. Please investigate."
}
},
{
"type": "divider"
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Failed Tasks:*\n```{message_body}```"
}
}
]
}
try:
response = requests.post(webhook_url, json=payload, timeout=10)
response.raise_for_status() # Raises an exception for bad status codes (4xx or 5xx)
print("Successfully sent notification to Slack.")
except requests.exceptions.RequestException as e:
print(f"Error sending Slack notification: {e}")
def main():
"""Main function to orchestrate the monitoring process."""
# Load environment variables from config.env
load_dotenv(dotenv_path='config.env')
webhook_url = os.getenv("SLACK_WEBHOOK_URL")
log_path = os.getenv("ANSIBLE_LOG_FILE")
if not webhook_url or not log_path:
print("Error: SLACK_WEBHOOK_URL or ANSIBLE_LOG_FILE not set in config.env")
return # End execution if config is missing
# Record the start time of this run
current_run_time = datetime.now(timezone.utc)
# Get the time of the last run
last_run_time = get_last_run_time()
print(f"Checking for failures since {last_run_time.isoformat()}...")
# Find new failures in the log
new_failures = parse_ansible_log(log_path, last_run_time)
# Send a notification if we found any
send_slack_notification(webhook_url, new_failures)
# Save the current run time for the next execution
save_current_run_time(current_run_time)
print("Check complete.")
if __name__ == "__main__":
main()
Step 4: Schedule the Script
The final step is to automate it. You don’t want to run this manually. In a Linux or macOS environment, my go-to is cron.
You can set up a cron job to run this script every, say, 15 minutes. The entry would look something like this. You will need to edit your cron table to add a line like the one below, but using the correct full path to your Python interpreter and script.
*/15 * * * * python3 /path/to/your/project/ansible_monitor.py
On Windows, you can use the Task Scheduler to achieve the same result, pointing it to your Python executable and providing the script path as an argument.
Common Pitfalls (Where I Usually Mess Up)
I’ve set this up a dozen times, and here are the traps I’ve fallen into:
- File Permissions: The user running the cron job needs read access to
ansible.logand read/write access tolast_run.txtand the directory it’s in. This is the most common issue. - Timezone Mismatches: My script assumes the Ansible logs are in UTC. If your server is logging in a different timezone, you’ll need to adjust the datetime parsing logic in the script to handle it correctly. Otherwise, you might miss failures or get duplicate alerts.
- Exposing the Webhook: I can’t stress this enough. Don’t commit the
config.envfile or paste the webhook URL directly into the script. Use environment variables. - Log Rotation: If you have a system like
logrotatemanaging youransible.log, make sure your script can handle the file being moved or renamed. The current script is simple and doesn’t account for this, but for production, you might need a more robust file-finding logic.
Conclusion
And that’s it. With a simple configuration change, a Slack webhook, and one Python script, you’ve built a robust monitoring system for your Ansible runs. This “set it and forget it” utility frees you from manual log checking and gives you immediate visibility into problems.
In my experience, automating small, tedious tasks like this is what separates a good DevOps practice from a great one. It saves time, reduces human error, and lets you focus on the more complex problems. Happy automating!
– Darian Vance
🤖 Frequently Asked Questions
❓ How can I monitor Ansible playbook failures and receive alerts?
You can monitor Ansible playbook failures by configuring Ansible to log output to a file, creating a Slack Incoming Webhook, and using a scheduled Python script to parse the log file for ‘FATAL’ or ‘FAILED’ entries, sending real-time notifications to a designated Slack channel.
❓ How does this automated monitoring compare to manual log checking?
Automated monitoring transforms reactive manual log checking into proactive, real-time alerting. It significantly saves time, reduces human error by eliminating the need for manual review, and provides immediate visibility into critical failures, allowing engineers to focus on more complex tasks.
❓ What is a common implementation pitfall when setting up this Ansible failure monitoring system?
A common pitfall is incorrect file permissions. The user account running the scheduled monitoring script (e.g., via cron) must have read access to the `ansible.log` file and read/write access to the `last_run.txt` state file and its directory. Ensure these permissions are correctly configured to prevent script execution failures.
Leave a Reply