🚀 Executive Summary
TL;DR: Automating RAID controller health monitoring on Linux servers is crucial to prevent silent catastrophic failures. This guide provides a Python script leveraging `storcli` and cron to proactively detect degraded arrays and send immediate alerts via webhooks, ensuring system stability and reducing manual oversight.
🎯 Key Takeaways
- The `storcli` command (`sudo storcli /c0 show all`) is fundamental for manually verifying RAID controller, virtual drive, and physical drive health, looking for ‘Optimal’ or ‘Online’ states.
- A Python script automates this check by executing `storcli`, parsing its output for ‘BAD_KEYWORDS’ like ‘Failed’, ‘Critical’, ‘Offline’, ‘Degraded’, or ‘Error’, and sending alerts via `requests` to a configured webhook URL.
- Scheduling the Python script with cron (e.g., `0 2 * * 1 python3 /path/to/script/check_raid.py`) ensures regular, automated monitoring, but requires careful attention to permissions (`sudo`), `PATH` environment variables, and firewall rules for outgoing connections.
Monitor RAID Controller Health Status on Linux Servers
Hey there, Darian Vance here. As a Senior DevOps Engineer at TechResolve, I’ve learned that some of the most catastrophic failures don’t announce themselves. They fester silently. I used to spend a good chunk of my Monday mornings manually SSH-ing into critical servers to check the RAID status. It was tedious, and honestly, a waste of time. After a close call with a degraded array that almost went unnoticed over a weekend, I knew I had to automate this. This guide is the result—a simple, effective way to get your servers to tell you when something is wrong. Let’s get this set up so you can focus on more important things.
Prerequisites
Before we dive in, make sure you have the following ready to go:
- A Linux server with a Broadcom/LSI/Avago RAID controller. These are incredibly common in Dell, HP, and Supermicro servers.
- The
storclicommand-line utility installed. This is the modern tool for managing these cards. If you have an older system, you might be usingmegacli, but the principles are the same. - Python 3 installed on the server.
- An incoming webhook URL from a service like Slack, Microsoft Teams, or PagerDuty. This is how our script will send alerts.
The Guide: Step-by-Step Automation
Step 1: Verify the Manual Command
First, let’s make sure we can get the health status directly from the command line. The whole automation is built on this foundation. SSH into your server and run the command to get a summary of your controller, virtual drives (your RAID arrays), and physical drives.
The command I use is:
sudo storcli /c0 /eall /sall show
You’re looking for clean, positive output. Keywords like “Status = Optimal”, “State = Optimal”, and “State = Online” are what we want to see. If you see anything like “Degraded,” “Failed,” or “Offline,” you have a problem that needs immediate attention. Once you’ve confirmed the command works and you understand its output, we can move on to automation.
Pro Tip: The
/c0specifies controller 0. Most servers only have one, but if you have multiple controllers, you’ll need to run this command for each one (/c1,/c2, etc.). Our script can easily be adapted to loop through multiple controllers.
Step 2: Prepare Your Python Script Environment
Now, let’s get our Python script set up. I’ll skip the standard project directory and virtual environment setup steps since you likely have your own workflow for that. The key is to have a clean space for our script and its dependencies.
You’ll need a couple of Python libraries. You can install them using pip. I recommend `python-dotenv` for managing secrets like our webhook URL, and `requests` for sending the web notifications.
Next, create two files in your project directory: `check_raid.py` for our script and `config.env` to store our webhook URL. Storing the webhook in a separate config file is a security best practice.
Your `config.env` file should contain one line:
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
Step 3: The Python Monitoring Script
Alright, let’s write the code. The logic is simple: run the `storcli` command, capture the output, check for any “bad” keywords, and fire off an alert if we find any.
Here is the full script for `check_raid.py`:
import os
import subprocess
import requests
from dotenv import load_dotenv
import socket
# --- Configuration ---
load_dotenv('config.env')
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL")
HOSTNAME = socket.gethostname()
# The command to check RAID status. Note the path might need to be absolute
# depending on the cron environment. For this example, we assume it's in the PATH.
RAID_COMMAND = ["sudo", "storcli", "/c0", "show", "all"]
# Keywords that indicate a problem.
# We are looking for the *absence* of good states, but checking for bad states is also a valid approach.
# For this tutorial, we will check for bad states to be explicit.
BAD_KEYWORDS = [
"Failed",
"Critical",
"Offline",
"Degraded",
"Error"
]
def check_raid_status():
"""Runs the storcli command and returns its output."""
try:
# We use check=True to automatically raise an exception if the command fails.
result = subprocess.run(
RAID_COMMAND,
capture_output=True,
text=True,
check=True
)
return result.stdout, None
except FileNotFoundError:
return None, f"Error: The command '{RAID_COMMAND[1]}' was not found. Is it installed and in the system's PATH?"
except subprocess.CalledProcessError as e:
error_message = f"Command failed with exit code {e.returncode}.\nStderr: {e.stderr}"
return None, error_message
def parse_output(output):
"""Parses the command output for any signs of trouble."""
found_issues = []
if not output:
return ["No output received from the command."]
# Convert to lowercase for case-insensitive matching
output_lower = output.lower()
for keyword in BAD_KEYWORDS:
if keyword.lower() in output_lower:
# Find the line containing the keyword for better context
for line in output.splitlines():
if keyword.lower() in line.lower():
found_issues.append(f"Detected '{keyword}': {line.strip()}")
# Return a list of unique issues
return list(set(found_issues))
def send_slack_alert(issues):
"""Sends a formatted alert to a Slack webhook."""
if not SLACK_WEBHOOK_URL:
print("Error: SLACK_WEBHOOK_URL is not set in config.env")
return
message = f"🚨 *RAID Health Alert on {HOSTNAME}* 🚨\n\n"
message += "The following potential issues were detected:\n"
message += "```\n"
for issue in issues:
message += f"- {issue}\n"
message += "```\n"
message += "Please investigate immediately."
payload = {"text": message}
try:
response = requests.post(SLACK_WEBHOOK_URL, json=payload)
response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)
print("Alert sent successfully.")
except requests.exceptions.RequestException as e:
print(f"Failed to send Slack alert: {e}")
def main():
"""Main function to orchestrate the check."""
print("Starting RAID health check...")
output, error = check_raid_status()
if error:
send_slack_alert([error])
return # Exit the function cleanly
issues = parse_output(output)
if issues:
print(f"Found {len(issues)} issues. Sending alert...")
send_slack_alert(issues)
else:
print("RAID status appears healthy. No issues detected.")
if __name__ == "__main__":
main()
Pro Tip: In my production setups, I make the `BAD_KEYWORDS` list more comprehensive. I also check for controller battery health (“BBU Status”) and predictive failure counts on drives (“S.M.A.R.T. alert”). The more specific you are, the more reliable your alerting will be.
Step 4: Schedule the Script with Cron
An automation script is only useful if it runs automatically. We’ll use cron, the standard Linux job scheduler, to run our check on a regular basis. You’ll want to edit the crontab file for a user with sufficient permissions to run the script (often the root user).
Add a line like this to schedule the script. This example runs it at 2:00 AM every Monday:
0 2 * * 1 python3 /path/to/your/project/check_raid.py
Make sure you use the correct, full path to your Python script. Once saved, the job is scheduled, and you can rest easy knowing your RAID arrays are being watched.
Common Pitfalls (Where I Usually Mess Up)
- Permissions Fiasco: The `storcli` command almost always requires `sudo`. If you run the cron job as a non-root user, it will fail. You either need to run the cron job as root or, more securely, add a `NOPASSWD` entry in the sudoers file for that specific command and user.
- The PATH Problem: The environment for a cron job is minimal and often doesn’t have the same `PATH` as your interactive shell. The script might fail because it can’t find `storcli`. The easiest fix is to use the full, absolute path to the utility inside the Python script’s `RAID_COMMAND` list.
- Firewall Blockage: On hardened servers, outgoing connections might be blocked by default. If your script runs but you never get an alert, check your firewall (like `ufw` or `iptables`) to ensure the server can make HTTPS connections to the Slack/Teams webhook URL.
Conclusion
And that’s it. You’ve now got a robust, automated monitoring system for one of your server’s most critical components. This simple script has saved my team countless hours and prevented at least one major outage. It’s a prime example of how a little bit of DevOps automation can provide huge value and peace of mind. Now, go enjoy that extra time you just saved.
– Darian Vance
🤖 Frequently Asked Questions
âť“ What are the essential prerequisites for setting up automated RAID health monitoring on Linux?
You need a Linux server with a Broadcom/LSI/Avago RAID controller, the `storcli` command-line utility installed, Python 3, and an incoming webhook URL from an alerting service like Slack or PagerDuty.
âť“ How does this automated RAID monitoring solution improve upon manual checks?
This automated script eliminates the need for tedious manual SSH checks, providing proactive, real-time alerts when RAID arrays or drives enter a ‘Degraded’, ‘Failed’, or ‘Offline’ state, significantly reducing the risk of unnoticed catastrophic failures and saving valuable time.
âť“ What are the most common challenges encountered when implementing this RAID monitoring script with cron?
Key challenges include ensuring the `storcli` command has `sudo` permissions, addressing the minimal `PATH` environment in cron by using absolute paths for utilities, and configuring firewalls to allow outgoing HTTPS connections to the webhook URL.
Leave a Reply