🚀 Executive Summary
TL;DR: This guide provides a straightforward Python script to automate Kubernetes node resource exhaustion alerts. It addresses the problem of manually checking node health by proactively sending Slack notifications when CPU or memory usage exceeds predefined thresholds, shifting from reactive to proactive cluster management.
🎯 Key Takeaways
- The solution utilizes the official Kubernetes Python client to connect to the cluster API and queries the `metrics.k8s.io` API for real-time CPU and memory usage.
- Resource usage is calculated as a percentage of the node’s `allocatable` capacity, providing a more accurate measure of node pressure than raw usage values.
- Secure configuration management is achieved using `python-dotenv` to load sensitive information like Slack webhook URLs from a `config.env` file, preventing hardcoding of secrets.
- The script is designed for automated scheduling, with a strong recommendation for deployment as a Kubernetes `CronJob` using a Docker container and appropriate RBAC permissions for a cloud-native, resilient approach.
- Common pitfalls include incorrect RBAC permissions for accessing node metrics, setting thresholds too low leading to alert fatigue, and not distinguishing between `config.load_kube_config()` for external access and `config.load_incluster_config()` for in-cluster deployments.
Track Kubernetes Node Resource Exhaustion (CPU/Memory) Alerts
Hey there, Darian Vance here. Let’s talk about something that used to be a real time-sink for me: manually checking node health. I’d `kubectl top nodes`, scan the list, and try to catch hotspots before they became infernos. It felt productive, but in reality, I was wasting at least a couple of hours a week on something that could be easily automated. Once an application went down because a node ran out of memory, I knew I had to build a better system. This guide is that system—a straightforward way to get proactive alerts on Slack when your Kubernetes nodes are under pressure.
Prerequisites
Before we dive in, make sure you have the following ready to go. This will make the process much smoother.
- Access to a Kubernetes cluster with your `kubeconfig` file correctly set up.
- Permissions to list nodes and get node metrics. You’ll need access to the `metrics.k8s.io` API group.
- A Python 3 environment.
- A Slack workspace where you can create an incoming webhook.
The Guide: Step-by-Step
Step 1: Setting Up Your Environment
First things first, let’s get our Python environment ready. I’ll skip the standard virtualenv setup since you likely have your own workflow for that. Let’s jump straight to the dependencies. You’ll need to install a few key libraries using pip. The main ones are `kubernetes` (the official Python client), `requests` (for sending our Slack message), and `python-dotenv` (to handle our configuration securely).
Step 2: The Python Script Logic
The goal here is simple: query the Kubernetes API for node metrics, check them against our defined thresholds, and fire an alert if anything looks problematic. Here’s the breakdown of what our script will do:
- Load Configuration: It will securely load the Kubernetes config from your environment and our Slack webhook URL from a `config.env` file. We never hardcode secrets.
- Connect to the API: Using the Python client, it establishes a connection to your cluster’s API server.
- Fetch Node Metrics: It queries the `metrics.k8s.io` API to get real-time CPU and memory usage for every node.
- Check Thresholds: It iterates through each node, comparing its current resource usage against the CPU and Memory thresholds we set. I’ve set them at 85% in the example, which is a reasonable starting point.
- Send Alert: If a node exceeds a threshold, the script formats a clear, concise message and sends it to a designated Slack channel using the webhook.
Pro Tip: Don’t just check usage; check capacity too. The script calculates the percentage of resource usage against the node’s total allocatable capacity. This gives you a much more accurate picture of node pressure than just looking at the raw usage values.
Step 3: The Script
Here is the complete Python script. I’ve added comments to walk you through the important sections. Save this as something like `node_monitor.py`.
import os
import requests
from kubernetes import client, config
from dotenv import load_dotenv
def send_slack_notification(message):
"""Sends a message to a Slack channel using a webhook."""
webhook_url = os.getenv("SLACK_WEBHOOK_URL")
if not webhook_url:
print("Error: SLACK_WEBHOOK_URL environment variable not set.")
return
payload = {'text': message}
try:
response = requests.post(webhook_url, json=payload)
response.raise_for_status()
print("Successfully sent notification to Slack.")
except requests.exceptions.RequestException as e:
print(f"Error sending Slack notification: {e}")
def check_node_resources():
"""
Checks the CPU and Memory usage of each node in the cluster
and sends a Slack alert if usage exceeds defined thresholds.
"""
# Define resource usage thresholds
CPU_THRESHOLD_PERCENT = 85.0
MEMORY_THRESHOLD_PERCENT = 85.0
try:
# Load Kubernetes configuration from default location (~/.kube/config)
# Or from in-cluster config if running inside a pod
config.load_kube_config()
# Create API clients
core_v1 = client.CoreV1Api()
custom_objects_api = client.CustomObjectsApi()
print("Fetching node metrics...")
# Get node metrics
node_metrics = custom_objects_api.list_cluster_custom_object(
"metrics.k8s.io", "v1beta1", "nodes"
)
nodes = core_v1.list_node().items
for stats in node_metrics['items']:
node_name = stats['metadata']['name']
# CPU Usage (in nanocores)
cpu_usage_str = stats['usage']['cpu']
# Memory Usage (in Kibibytes)
mem_usage_str = stats['usage']['memory']
# Find the corresponding node spec for capacity
node_spec = next((n for n in nodes if n.metadata.name == node_name), None)
if not node_spec:
continue
# CPU Capacity (in cores, needs conversion to nanocores)
cpu_capacity_str = node_spec.status.allocatable['cpu']
# Memory Capacity (in Kibibytes)
mem_capacity_str = node_spec.status.allocatable['memory']
# --- Data Cleaning and Conversion ---
# Convert CPU from 'n' (nanocores) to a float
cpu_usage_val = float(cpu_usage_str.rstrip('n'))
# Convert Memory from 'Ki' to a float
mem_usage_val = float(mem_usage_str.rstrip('Ki'))
# Convert CPU capacity to nanocores
if 'm' in cpu_capacity_str: # millicores
cpu_capacity_val = float(cpu_capacity_str.rstrip('m')) * 1_000_000
else: # full cores
cpu_capacity_val = float(cpu_capacity_str) * 1_000_000_000
# Convert Memory from 'Ki' to a float
mem_capacity_val = float(mem_capacity_str.rstrip('Ki'))
# --- Percentage Calculation ---
cpu_percent = (cpu_usage_val / cpu_capacity_val) * 100
mem_percent = (mem_usage_val / mem_capacity_val) * 100
print(f"Node: {node_name} | CPU: {cpu_percent:.2f}% | Memory: {mem_percent:.2f}%")
# --- Alerting Logic ---
if cpu_percent > CPU_THRESHOLD_PERCENT:
message = (f":alert: *High CPU Alert!* `\n`"
f"*Node:* `{node_name}` `\n`"
f"*Usage:* `{cpu_percent:.2f}%` (Threshold: `{CPU_THRESHOLD_PERCENT}%`)")
send_slack_notification(message)
if mem_percent > MEMORY_THRESHOLD_PERCENT:
message = (f":alert: *High Memory Alert!* `\n`"
f"*Node:* `{node_name}` `\n`"
f"*Usage:* `{mem_percent:.2f}%` (Threshold: `{MEMORY_THRESHOLD_PERCENT}%`)")
send_slack_notification(message)
except Exception as e:
print(f"An error occurred: {e}")
# Optionally send an error notification to Slack
send_slack_notification(f"Error in Node Monitoring Script: {e}")
if __name__ == "__main__":
# Load environment variables from config.env
load_dotenv(dotenv_path='config.env')
check_node_resources()
Step 4: Creating the `config.env` File
Create a file named `config.env` in the same directory as your Python script. This is where we’ll store our Slack webhook URL. It keeps our secrets out of the code, which is a critical security practice.
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX"
Make sure to replace the URL with your actual Slack incoming webhook URL.
Step 5: Scheduling the Script
This script is only useful if it runs automatically. The classic way to do this is with a cron job on a server that has access to your cluster. You could set it to run every 15 minutes, for example.
A sample cron entry would look like this:
`*/15 * * * * python3 script.py`
Pro Tip: For a more cloud-native approach, I strongly recommend running this as a Kubernetes `CronJob`. You can package the script into a Docker container, grant it the necessary RBAC permissions via a `ServiceAccount`, and let Kubernetes manage the scheduling. This is far more resilient than relying on a separate monitoring server.
Common Pitfalls (Where I Usually Mess Up)
- RBAC Permissions: The first time I set this up, I spent an hour debugging before I realized the service account running my script didn’t have permission to `get` and `list` nodes or access the metrics API. Always check your `ClusterRole` and `ClusterRoleBinding` first.
- Thresholds Too Low: Setting your thresholds too low (e.g., 60%) will lead to alert fatigue. Your team will start ignoring the messages. Start at 85% and adjust based on your cluster’s typical workload patterns.
- Forgetting About In-Cluster Config: If you run this script inside a pod (like in a Kubernetes CronJob), you should use `config.load_incluster_config()` instead of `config.load_kube_config()`. The example script handles the standard case, but it’s a key distinction for in-cluster deployment.
Conclusion
And that’s it. You now have a robust, automated monitoring script that will keep you ahead of resource exhaustion issues on your nodes. This isn’t just about getting alerts; it’s about reclaiming your time and shifting from a reactive to a proactive mindset. Feel free to expand on this script—add logic to check disk pressure, track pod restarts, or integrate with other notification systems. Happy automating!
🤖 Frequently Asked Questions
âť“ How can I automate Kubernetes node resource exhaustion alerts?
You can automate Kubernetes node resource exhaustion alerts using a Python script that queries the `metrics.k8s.io` API for real-time CPU and memory usage, compares it against defined thresholds (e.g., 85%), and sends proactive notifications to a Slack channel via a webhook.
âť“ How does this custom script compare to alternative Kubernetes monitoring solutions?
This custom Python script offers a lightweight, open-source, and highly customizable solution specifically for CPU/memory exhaustion alerts, contrasting with more comprehensive, often proprietary, monitoring platforms that provide broader observability but may introduce higher complexity and cost. It’s a targeted, self-managed approach for a specific problem.
âť“ What is a common implementation pitfall when setting up Kubernetes node resource monitoring?
A common pitfall is insufficient RBAC permissions. The `ServiceAccount` running the monitoring script must have `get` and `list` permissions for nodes and access to the `metrics.k8s.io` API group to fetch resource usage data. Always verify your `ClusterRole` and `ClusterRoleBinding` configurations.
Leave a Reply