Solved: Monitor ZFS Pool Health and Scrub Status via Cron

🚀 Executive Summary

TL;DR: Manual ZFS pool health and scrub status checks are inefficient and risky, potentially missing silent disk failures. This guide provides an automated solution using a Python script to parse `zpool status` output for errors and scrub recency, sending email alerts via a scheduled cron job.

🎯 Key Takeaways

Automated ZFS pool health monitoring can be achieved with a Python script that executes `zpool status -v` and parses its output for pool state, data errors, and scrub completion.
The Python script identifies issues by checking if the pool is `ONLINE`, if there are `No known data errors`, and if the last `scrub` was completed within a defined `MAX_SCRUB_DAYS`.
When scheduling with cron, it’s critical to address cron’s minimal `PATH` environment by using full command paths or defining `PATH` in the crontab, and to ensure script permissions and outbound firewall rules for SMTP alerts.

Monitor ZFS Pool Health and Scrub Status via Cron

Hey there, I’m Darian Vance. As a Senior DevOps Engineer here at TechResolve, I’ve spent more hours than I can count staring at `zpool status` outputs. For years, my monitoring strategy was, well, not a strategy. It was me remembering to SSH into our filers and manually check for errors. After a close call with a silent disk failure that a scrub would have caught, I knew I had to automate. This simple setup I’m about to show you has saved me countless hours and gives me genuine peace of mind. It’s a “set it and forget it” solution that only alerts you when you actually need to do something.

Let’s build a reliable, automated ZFS health check that respects your time.

Prerequisites

A server with a ZFS pool already configured.
Python 3 installed on that server.
A method for sending notifications (e.g., an SMTP server for email, or a webhook URL for a service like Slack).
Basic familiarity with cron jobs.

The Guide: Step-by-Step

Step 1: The Python Monitoring Script

First things first, we need a script to do the heavy lifting. This script will execute `zpool status`, parse the output, and decide if a notification is necessary. I’ll skip the standard virtualenv setup since you likely have your own workflow for that. Just make sure you install the necessary libraries, like `python-dotenv` if you choose to use it for managing configuration.

Let’s create a file named `zfs_monitor.py`. Here is the logic I use in my production setups.


import os
import subprocess
import smtplib
from email.mime.text import MIMEText
from datetime import datetime, timedelta

# --- Configuration ---
# I strongly recommend using environment variables or a config file
# rather than hardcoding credentials. For this example, we'll use os.getenv.
# Create a 'config.env' file in the same directory.
# Example config.env:
# ZPOOL_NAME=your_pool_name
# ALERT_EMAIL_TO=your-email@example.com
# ALERT_EMAIL_FROM=zfs-monitor@your-server.com
# SMTP_SERVER=smtp.your-provider.com
# SMTP_PORT=587
# SMTP_USER=your-smtp-user
# SMTP_PASSWORD=your-smtp-password

ZPOOL_NAME = os.getenv('ZPOOL_NAME', 'rpool') # Default to 'rpool' if not set
ALERT_EMAIL_TO = os.getenv('ALERT_EMAIL_TO')
MAX_SCRUB_DAYS = 35 # Alert if the last scrub was more than this many days ago

def send_alert(subject, body):
    """Sends an email alert."""
    if not ALERT_EMAIL_TO:
        print("ALERT: Email recipient not configured. Cannot send alert.")
        print(f"Subject: {subject}\nBody: {body}")
        return

    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = os.getenv('ALERT_EMAIL_FROM')
    msg['To'] = ALERT_EMAIL_TO

    try:
        with smtplib.SMTP(os.getenv('SMTP_SERVER'), int(os.getenv('SMTP_PORT'))) as server:
            server.starttls()
            server.login(os.getenv('SMTP_USER'), os.getenv('SMTP_PASSWORD'))
            server.sendmail(msg['From'], [msg['To']], msg.as_string())
            print("Successfully sent alert email.")
    except Exception as e:
        print(f"Failed to send email: {e}")

def check_zfs_pool():
    """Checks the ZFS pool status and looks for issues."""
    try:
        command = ["zpool", "status", "-v", ZPOOL_NAME]
        result = subprocess.run(command, capture_output=True, text=True, check=True)
        output = result.stdout
        alerts = []

        # 1. Check overall pool state
        if f"state: ONLINE" not in output:
            alerts.append(f"Pool '{ZPOOL_NAME}' is not in an ONLINE state.")

        # 2. Check for any errors
        if "errors: No known data errors" not in output:
            alerts.append(f"Pool '{ZPOOL_NAME}' reports data errors.")

        # 3. Check last scrub date
        scrub_completed = False
        for line in output.splitlines():
            if "scrub" in line and "completed on" in line:
                # Example line: "scan: scrub repaired 0B in 00:01:13 with 0 errors on Sun Nov 19 02:25:17 2023"
                parts = line.split()
                date_str = " ".join(parts[-5:])
                try:
                    last_scrub_date = datetime.strptime(date_str, '%a %b %d %H:%M:%S %Y')
                    if datetime.now() - last_scrub_date > timedelta(days=MAX_SCRUB_DAYS):
                        alerts.append(f"Last scrub for '{ZPOOL_NAME}' was over {MAX_SCRUB_DAYS} days ago.")
                    scrub_completed = True
                except ValueError:
                    alerts.append(f"Could not parse last scrub date from line: {line}")
                break
        
        if not scrub_completed:
            # This handles cases where a scrub has never been completed
            alerts.append(f"Could not find a completed scrub for pool '{ZPOOL_NAME}'.")

        # If we have any alerts, send them
        if alerts:
            subject = f"ZFS Alert for Pool: {ZPOOL_NAME}"
            body = "The following issues were detected:\n\n" + "\n".join(f"- {a}" for a in alerts)
            body += f"\n\n--- Full 'zpool status -v' Output ---\n\n{output}"
            send_alert(subject, body)
        else:
            print(f"Pool '{ZPOOL_NAME}' is healthy. No issues found.")

    except FileNotFoundError:
        print("Error: 'zpool' command not found. Is ZFS installed and in the system's PATH?")
    except subprocess.CalledProcessError as e:
        subject = f"ZFS Critical Alert: Failed to check pool {ZPOOL_NAME}"
        body = f"The command 'zpool status' failed to execute.\n\nStderr:\n{e.stderr}"
        send_alert(subject, body)
    except Exception as e:
        subject = f"ZFS Monitor Script Error"
        body = f"An unexpected error occurred in the monitoring script: {e}"
        send_alert(subject, body)

if __name__ == "__main__":
    # You'll need to create a 'config.env' file for this to work
    # or set the environment variables manually before running.
    # from dotenv import load_dotenv
    # load_dotenv('config.env')
    check_zfs_pool()

Pro Tip: In my production setups, I create a dedicated, non-privileged user to run this script. The `zpool` command doesn’t require root privileges for status checks, so this is a great way to limit the script’s potential blast radius. Safety first!

Step 2: Scheduling with Cron

With our script ready, we need to schedule it to run automatically. This is where cron comes in. We’ll set it to run once a week, which is a reasonable frequency for scrub checks.

You’ll need to edit your user’s crontab file. I won’t specify the command for that, as it can vary, but once you have it open, add a line like this:

0 2 * * 1 python3 zfs_monitor.py

Let’s break that down:

0 2 * * 1: This means “At 02:00 on Monday.” It’s a good practice to run scrubs and checks during off-peak hours.
python3 zfs_monitor.py: This is the command to execute. Important: Cron jobs have a minimal environment. It’s crucial that `python3` is in the path and that the script `zfs_monitor.py` is in the directory where cron executes, or you provide a full path to it. I recommend placing the script in a dedicated directory in your user’s home folder.

Pro Tip: Add a “heartbeat” to your script. Modify it to send a “System OK” email once a month, regardless of status. If you don’t get that email, you know the cron job itself has failed, which is a problem monitoring your monitor can solve!

Common Pitfalls

Here’s where I’ve tripped up in the past, so you don’t have to:

Cron’s Minimal PATH: This is the number one issue. Your interactive shell has a rich `PATH` variable, but cron’s is barebones. The script might fail because it can’t find `zpool`. The easiest fix is to use the full path to the `zpool` executable within the script’s `command` list, but since I can’t write those paths here, another robust solution is to define the `PATH` at the top of your crontab file.
Permissions: The script file must be executable by the user running the cron job. A quick permissions check can save a lot of head-scratching.
Firewall Rules: If you’re sending emails via SMTP, make sure your server’s firewall allows outbound connections on the specified port (usually 587 or 465). I once spent an hour debugging my script only to find a firewall rule was silently dropping the packets. Always test the notification function manually from the server.

Conclusion

And that’s it. With one Python script and a single line in your crontab, you’ve moved from reactive, manual checks to proactive, automated monitoring. This frees you up to focus on more complex problems, secure in the knowledge that if your storage pool needs attention, you’ll be the first to know. Now, go grab a coffee—you’ve earned it.

All the best,
Darian Vance
Senior DevOps Engineer, TechResolve

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.

🤖 Frequently Asked Questions

❓ How can I automatically monitor my ZFS pool’s health and scrub status?

You can automate ZFS pool monitoring by creating a Python script that runs `zpool status -v`, parses the output for the pool’s `ONLINE` state, `data errors`, and the `last scrub date`. If issues are found or the scrub is too old, the script sends an email alert, and it’s scheduled to run periodically using a cron job.

❓ How does this cron-based ZFS monitoring compare to more comprehensive monitoring solutions?

This cron-based solution provides a lightweight, focused, and customizable method for ZFS health checks, suitable for direct server monitoring. It’s simpler than full enterprise monitoring systems like Prometheus or Zabbix, which offer broader system metrics, dashboards, and various alert integrations, making it ideal for specific ZFS-centric alerts without extensive infrastructure.

❓ What is a common implementation pitfall for this ZFS monitoring setup, and how is it resolved?

A common pitfall is cron’s minimal `PATH` environment, causing the `zpool` command to be “not found.” This is resolved by either specifying the full path to the `zpool` executable within the Python script (e.g., `/usr/sbin/zpool` or `/sbin/zpool` depending on system) or by explicitly defining the `PATH` variable at the top of the crontab file.

TechResolve – SaaS Troubleshooting & Software Alternatives

🚀 Executive Summary

🎯 Key Takeaways

Monitor ZFS Pool Health and Scrub Status via Cron

Prerequisites

The Guide: Step-by-Step

Step 1: The Python Monitoring Script

Step 2: Scheduling with Cron

Common Pitfalls

Conclusion

Darian Vance

🤖 Frequently Asked Questions

❓ How can I automatically monitor my ZFS pool’s health and scrub status?

❓ How does this cron-based ZFS monitoring compare to more comprehensive monitoring solutions?

❓ What is a common implementation pitfall for this ZFS monitoring setup, and how is it resolved?

Like this:

Leave a ReplyCancel reply

🚀 Executive Summary

🎯 Key Takeaways

Monitor ZFS Pool Health and Scrub Status via Cron

Prerequisites

The Guide: Step-by-Step

Step 1: The Python Monitoring Script

Step 2: Scheduling with Cron

Common Pitfalls

Conclusion

Darian Vance

🤖 Frequently Asked Questions

❓ How can I automatically monitor my ZFS pool’s health and scrub status?

❓ How does this cron-based ZFS monitoring compare to more comprehensive monitoring solutions?

❓ What is a common implementation pitfall for this ZFS monitoring setup, and how is it resolved?

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives