🚀 Executive Summary
TL;DR: The widely accepted shell scripting ‘best practice’ `set -e` is often dangerously unreliable, leading to silent production failures due to its erratic behavior and numerous exceptions. Robust alternatives include explicit exit code checks, using `trap` for centralized error handling and cleanup, or deliberately managing failure domains without `set -e`.
🎯 Key Takeaways
- The `set -e` (or `set -o errexit`) option in shell scripts creates a false sense of security, as its behavior is riddled with exceptions where it will ignore command failures.
- Explicitly checking the exit code (`$?`) immediately after critical commands, or using `command || { error_handler; exit 1; }`, provides clear, unambiguous, and predictable error handling.
- The `trap` command, particularly with `EXIT` and `ERR` signals, offers a robust, centralized mechanism for cleanup and notification, ensuring critical logic runs regardless of where or why a script fails.
Senior DevOps Engineer Darian Vance explains why the ‘set -e’ best practice in shell scripts is often dangerously unreliable. Learn robust, real-world alternatives for error handling that won’t silently break your production systems.
That One “Best Practice” I Threw in the Trash: The Truth About ‘set -e’
It was 3 AM. A PagerDuty alert screamed about potential data corruption on prod-db-01. The nightly backup script, a sacred cow nobody had touched in years, had been logging “success” for over a week. But it wasn’t succeeding. The underlying rsync command was failing due to a permissions change on the NAS target, but a tricky interaction with set -e in a command substitution meant the script exited before it could run the failure notification function. It just died silently in the dark. That night, after we manually recovered the backups, I declared a personal war on the blind worship of set -e.
The “Why”: Best Intentions, Dangerous Reality
The gospel handed down on high says set -e (or its long-form alias, set -o errexit) makes your scripts safer. The idea is simple: the script will exit immediately if any command fails. It sounds great, right? Fail fast, prevent chaos. The problem is that the ‘e’ in set -e might as well stand for ‘erratic’, because its behavior is riddled with exceptions that will burn you.
It’s a landmine because it creates a false sense of security. You think you’re protected, but here are just a few common situations where set -e will completely ignore a command failure:
- When the failing command is part of a conditional, like an
iforwhilestatement. - When the failing command is on the left side of a pipe (
|), unless you also useset -o pipefail. - When the command’s result is being negated with
!. - When a command is part of a list connected by
&&or||(except for the last command).
This inconsistency is the real danger. The cognitive overhead of tracking when it will or won’t trigger is more dangerous than just handling your errors deliberately. Let’s talk about how to do that.
Solution 1: The Paranoid Check (Quick & Clear)
Instead of relying on global, unpredictable magic, be explicit where it matters most. For any command that absolutely must succeed for the script to be valid, check its exit code ($?) immediately after it runs. It’s more verbose, but it’s as clear as a bell and has zero ambiguity.
Consider this script snippet:
# The "I hope set -e works" method
set -e
pg_dump -U postgres -h prod-db-01 my_database > /mnt/backups/db.sql
tar -czf /mnt/backups/db-backup-$(date +%F).tar.gz /mnt/backups/db.sql
A much safer, more explicit version looks like this:
# The explicit, "I trust nothing" method
pg_dump -U postgres -h prod-db-01 my_database > /mnt/backups/db.sql
if [ $? -ne 0 ]; then
echo "FATAL: pg_dump failed for prod-db-01! Aborting."
# send_pagerduty_alert "DB Backup Failed"
exit 1
fi
tar -czf /mnt/backups/db-backup-$(date +%F).tar.gz /mnt/backups/db.sql
if [ $? -ne 0 ]; then
echo "FATAL: tar command failed! Incomplete backup artifact."
# send_pagerduty_alert "Backup Compression Failed"
exit 1
fi
echo "Backup completed successfully."
Pro Tip: You can shorten the check with a boolean operator:
command || { echo "It failed!"; exit 1; }. It’s concise and achieves the same explicit check.
Solution 2: The Grown-Up Script with Traps
For any script that isn’t just a simple sequence of commands, you need real error handling. This is where trap comes in. A trap is a command that executes when your script receives a certain signal. We can set a trap to run a cleanup function on ERR (any command fails) or EXIT (the script finishes for any reason).
This approach gives you a centralized place to handle cleanup (like removing temp files) and sending notifications, making your scripts incredibly robust.
#!/bin/bash
set -o nounset # Treat unset variables as an error
set -o pipefail # This one is actually useful and I recommend it
# Define a cleanup function
function cleanup {
local exit_code=$?
echo "---"
echo "Executing cleanup..."
rm -f /tmp/backup.lock
if [ ${exit_code} -ne 0 ]; then
echo "SCRIPT FAILED with exit code ${exit_code}"
else
echo "Script finished successfully."
fi
}
# Set the trap: run the 'cleanup' function on ERR and EXIT
trap cleanup EXIT ERR
echo "Creating lockfile..."
touch /tmp/backup.lock || { echo "Failed to create lockfile"; exit 1; }
echo "Running a command that will succeed..."
ls -l / > /dev/null
echo "Running a command that will fail..."
# This will trigger the ERR trap, which then runs our cleanup function
grep "this-pattern-does-not-exist" /etc/hostname
# This part of the script will never be reached
echo "This line is unreachable because the script will exit."
Using trap is my default for any automation that runs in production. It guarantees that my logging and cleanup logic will run, no matter where or why the script fails.
Solution 3: The “Heresy” Option — Just Stop Using It
Alright, here’s the hot take I brought back from that 3 AM incident. My team’s new standard is to not use set -e by default in our scripts. It’s heresy to some, but it’s pragmatic.
We replaced a magical, unreliable safety net with a simple, conscious engineering decision:
- For each command, we ask: “What happens if this fails?”
- If the answer is “nothing important,” we let it be. A failed
rmof a temp file that might not exist is fine. - If the answer is “the rest of the script is pointless or dangerous,” we add an explicit check right after it (like in Solution 1).
This forces us to think about our failure domains. It’s more work up front, but it completely eliminates the category of “I thought set -e would save me” bugs. We trade a false sense of security for intentional, predictable code.
Warning: This is not an excuse to write sloppy code! It’s a call to be more deliberate about your error handling, rather than outsourcing that thinking to a global setting that can and will betray you when you least expect it.
Stop treating “best practices” as unbreakable laws. Understand the why behind them. In the case of set -e, the original goal was to prevent silent failures. Ironically, its own unpredictable behavior is one of the biggest causes of them. Be explicit, be deliberate, and you’ll sleep better at night. Trust me.
🤖 Frequently Asked Questions
âť“ What are the main issues with relying on `set -e` in shell scripts?
`set -e` is unreliable because it has numerous exceptions where it won’t exit on failure, such as when a command is part of a conditional (`if`/`while`), on the left side of a pipe (without `pipefail`), when negated with `!`, or within `&&`/`||` lists (except the last command). This inconsistency leads to silent failures and a false sense of security.
âť“ How do explicit error checks compare to `set -e` for ensuring script reliability?
Explicit error checks using `if [ $? -ne 0 ]` or `command || { … }` are more verbose but offer clear, predictable, and unambiguous error handling. In contrast, `set -e` is concise but its erratic behavior and numerous exceptions make it dangerously unreliable, often causing silent failures that are difficult to diagnose.
âť“ What is a common pitfall when trying to ensure script reliability, and how can `trap` help?
A common pitfall is assuming `set -e` will reliably catch all command failures, leading to silent issues. The `trap` command helps by allowing you to define a `cleanup` function that executes on specific signals like `ERR` (any command fails) or `EXIT` (script finishes), guaranteeing that cleanup, logging, and notification logic runs consistently, regardless of the script’s exit path.
Leave a Reply