🚀 Executive Summary

TL;DR: This guide provides a straightforward Python script to automate Kubernetes node resource exhaustion alerts. It addresses the problem of manually checking node health by proactively sending Slack notifications when CPU or memory usage exceeds predefined thresholds, shifting from reactive to proactive cluster management.

🎯 Key Takeaways

  • The solution utilizes the official Kubernetes Python client to connect to the cluster API and queries the `metrics.k8s.io` API for real-time CPU and memory usage.
  • Resource usage is calculated as a percentage of the node’s `allocatable` capacity, providing a more accurate measure of node pressure than raw usage values.
  • Secure configuration management is achieved using `python-dotenv` to load sensitive information like Slack webhook URLs from a `config.env` file, preventing hardcoding of secrets.
  • The script is designed for automated scheduling, with a strong recommendation for deployment as a Kubernetes `CronJob` using a Docker container and appropriate RBAC permissions for a cloud-native, resilient approach.
  • Common pitfalls include incorrect RBAC permissions for accessing node metrics, setting thresholds too low leading to alert fatigue, and not distinguishing between `config.load_kube_config()` for external access and `config.load_incluster_config()` for in-cluster deployments.

Track Kubernetes Node Resource Exhaustion (CPU/Memory) Alerts

Track Kubernetes Node Resource Exhaustion (CPU/Memory) Alerts

Hey there, Darian Vance here. Let’s talk about something that used to be a real time-sink for me: manually checking node health. I’d `kubectl top nodes`, scan the list, and try to catch hotspots before they became infernos. It felt productive, but in reality, I was wasting at least a couple of hours a week on something that could be easily automated. Once an application went down because a node ran out of memory, I knew I had to build a better system. This guide is that system—a straightforward way to get proactive alerts on Slack when your Kubernetes nodes are under pressure.

Prerequisites

Before we dive in, make sure you have the following ready to go. This will make the process much smoother.

  • Access to a Kubernetes cluster with your `kubeconfig` file correctly set up.
  • Permissions to list nodes and get node metrics. You’ll need access to the `metrics.k8s.io` API group.
  • A Python 3 environment.
  • A Slack workspace where you can create an incoming webhook.

The Guide: Step-by-Step

Step 1: Setting Up Your Environment

First things first, let’s get our Python environment ready. I’ll skip the standard virtualenv setup since you likely have your own workflow for that. Let’s jump straight to the dependencies. You’ll need to install a few key libraries using pip. The main ones are `kubernetes` (the official Python client), `requests` (for sending our Slack message), and `python-dotenv` (to handle our configuration securely).

Step 2: The Python Script Logic

The goal here is simple: query the Kubernetes API for node metrics, check them against our defined thresholds, and fire an alert if anything looks problematic. Here’s the breakdown of what our script will do:

  1. Load Configuration: It will securely load the Kubernetes config from your environment and our Slack webhook URL from a `config.env` file. We never hardcode secrets.
  2. Connect to the API: Using the Python client, it establishes a connection to your cluster’s API server.
  3. Fetch Node Metrics: It queries the `metrics.k8s.io` API to get real-time CPU and memory usage for every node.
  4. Check Thresholds: It iterates through each node, comparing its current resource usage against the CPU and Memory thresholds we set. I’ve set them at 85% in the example, which is a reasonable starting point.
  5. Send Alert: If a node exceeds a threshold, the script formats a clear, concise message and sends it to a designated Slack channel using the webhook.

Pro Tip: Don’t just check usage; check capacity too. The script calculates the percentage of resource usage against the node’s total allocatable capacity. This gives you a much more accurate picture of node pressure than just looking at the raw usage values.

Step 3: The Script

Here is the complete Python script. I’ve added comments to walk you through the important sections. Save this as something like `node_monitor.py`.


import os
import requests
from kubernetes import client, config
from dotenv import load_dotenv

def send_slack_notification(message):
    """Sends a message to a Slack channel using a webhook."""
    webhook_url = os.getenv("SLACK_WEBHOOK_URL")
    if not webhook_url:
        print("Error: SLACK_WEBHOOK_URL environment variable not set.")
        return

    payload = {'text': message}
    try:
        response = requests.post(webhook_url, json=payload)
        response.raise_for_status()
        print("Successfully sent notification to Slack.")
    except requests.exceptions.RequestException as e:
        print(f"Error sending Slack notification: {e}")

def check_node_resources():
    """
    Checks the CPU and Memory usage of each node in the cluster
    and sends a Slack alert if usage exceeds defined thresholds.
    """
    # Define resource usage thresholds
    CPU_THRESHOLD_PERCENT = 85.0
    MEMORY_THRESHOLD_PERCENT = 85.0

    try:
        # Load Kubernetes configuration from default location (~/.kube/config)
        # Or from in-cluster config if running inside a pod
        config.load_kube_config()
        
        # Create API clients
        core_v1 = client.CoreV1Api()
        custom_objects_api = client.CustomObjectsApi()
        
        print("Fetching node metrics...")
        # Get node metrics
        node_metrics = custom_objects_api.list_cluster_custom_object(
            "metrics.k8s.io", "v1beta1", "nodes"
        )

        nodes = core_v1.list_node().items

        for stats in node_metrics['items']:
            node_name = stats['metadata']['name']
            
            # CPU Usage (in nanocores)
            cpu_usage_str = stats['usage']['cpu']
            # Memory Usage (in Kibibytes)
            mem_usage_str = stats['usage']['memory']

            # Find the corresponding node spec for capacity
            node_spec = next((n for n in nodes if n.metadata.name == node_name), None)
            if not node_spec:
                continue

            # CPU Capacity (in cores, needs conversion to nanocores)
            cpu_capacity_str = node_spec.status.allocatable['cpu']
            # Memory Capacity (in Kibibytes)
            mem_capacity_str = node_spec.status.allocatable['memory']

            # --- Data Cleaning and Conversion ---
            # Convert CPU from 'n' (nanocores) to a float
            cpu_usage_val = float(cpu_usage_str.rstrip('n'))
            
            # Convert Memory from 'Ki' to a float
            mem_usage_val = float(mem_usage_str.rstrip('Ki'))

            # Convert CPU capacity to nanocores
            if 'm' in cpu_capacity_str: # millicores
                cpu_capacity_val = float(cpu_capacity_str.rstrip('m')) * 1_000_000
            else: # full cores
                cpu_capacity_val = float(cpu_capacity_str) * 1_000_000_000
            
            # Convert Memory from 'Ki' to a float
            mem_capacity_val = float(mem_capacity_str.rstrip('Ki'))
            
            # --- Percentage Calculation ---
            cpu_percent = (cpu_usage_val / cpu_capacity_val) * 100
            mem_percent = (mem_usage_val / mem_capacity_val) * 100

            print(f"Node: {node_name} | CPU: {cpu_percent:.2f}% | Memory: {mem_percent:.2f}%")

            # --- Alerting Logic ---
            if cpu_percent > CPU_THRESHOLD_PERCENT:
                message = (f":alert: *High CPU Alert!* `\n`"
                           f"*Node:* `{node_name}` `\n`"
                           f"*Usage:* `{cpu_percent:.2f}%` (Threshold: `{CPU_THRESHOLD_PERCENT}%`)")
                send_slack_notification(message)

            if mem_percent > MEMORY_THRESHOLD_PERCENT:
                message = (f":alert: *High Memory Alert!* `\n`"
                           f"*Node:* `{node_name}` `\n`"
                           f"*Usage:* `{mem_percent:.2f}%` (Threshold: `{MEMORY_THRESHOLD_PERCENT}%`)")
                send_slack_notification(message)

    except Exception as e:
        print(f"An error occurred: {e}")
        # Optionally send an error notification to Slack
        send_slack_notification(f"Error in Node Monitoring Script: {e}")

if __name__ == "__main__":
    # Load environment variables from config.env
    load_dotenv(dotenv_path='config.env')
    check_node_resources()

Step 4: Creating the `config.env` File

Create a file named `config.env` in the same directory as your Python script. This is where we’ll store our Slack webhook URL. It keeps our secrets out of the code, which is a critical security practice.


SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX"

Make sure to replace the URL with your actual Slack incoming webhook URL.

Step 5: Scheduling the Script

This script is only useful if it runs automatically. The classic way to do this is with a cron job on a server that has access to your cluster. You could set it to run every 15 minutes, for example.

A sample cron entry would look like this:

`*/15 * * * * python3 script.py`

Pro Tip: For a more cloud-native approach, I strongly recommend running this as a Kubernetes `CronJob`. You can package the script into a Docker container, grant it the necessary RBAC permissions via a `ServiceAccount`, and let Kubernetes manage the scheduling. This is far more resilient than relying on a separate monitoring server.

Common Pitfalls (Where I Usually Mess Up)

  • RBAC Permissions: The first time I set this up, I spent an hour debugging before I realized the service account running my script didn’t have permission to `get` and `list` nodes or access the metrics API. Always check your `ClusterRole` and `ClusterRoleBinding` first.
  • Thresholds Too Low: Setting your thresholds too low (e.g., 60%) will lead to alert fatigue. Your team will start ignoring the messages. Start at 85% and adjust based on your cluster’s typical workload patterns.
  • Forgetting About In-Cluster Config: If you run this script inside a pod (like in a Kubernetes CronJob), you should use `config.load_incluster_config()` instead of `config.load_kube_config()`. The example script handles the standard case, but it’s a key distinction for in-cluster deployment.

Conclusion

And that’s it. You now have a robust, automated monitoring script that will keep you ahead of resource exhaustion issues on your nodes. This isn’t just about getting alerts; it’s about reclaiming your time and shifting from a reactive to a proactive mindset. Feel free to expand on this script—add logic to check disk pressure, track pod restarts, or integrate with other notification systems. Happy automating!

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ How can I automate Kubernetes node resource exhaustion alerts?

You can automate Kubernetes node resource exhaustion alerts using a Python script that queries the `metrics.k8s.io` API for real-time CPU and memory usage, compares it against defined thresholds (e.g., 85%), and sends proactive notifications to a Slack channel via a webhook.

âť“ How does this custom script compare to alternative Kubernetes monitoring solutions?

This custom Python script offers a lightweight, open-source, and highly customizable solution specifically for CPU/memory exhaustion alerts, contrasting with more comprehensive, often proprietary, monitoring platforms that provide broader observability but may introduce higher complexity and cost. It’s a targeted, self-managed approach for a specific problem.

âť“ What is a common implementation pitfall when setting up Kubernetes node resource monitoring?

A common pitfall is insufficient RBAC permissions. The `ServiceAccount` running the monitoring script must have `get` and `list` permissions for nodes and access to the `metrics.k8s.io` API group to fetch resource usage data. Always verify your `ClusterRole` and `ClusterRoleBinding` configurations.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading