🚀 Executive Summary

TL;DR: The widely accepted shell scripting ‘best practice’ `set -e` is often dangerously unreliable, leading to silent production failures due to its erratic behavior and numerous exceptions. Robust alternatives include explicit exit code checks, using `trap` for centralized error handling and cleanup, or deliberately managing failure domains without `set -e`.

🎯 Key Takeaways

  • The `set -e` (or `set -o errexit`) option in shell scripts creates a false sense of security, as its behavior is riddled with exceptions where it will ignore command failures.
  • Explicitly checking the exit code (`$?`) immediately after critical commands, or using `command || { error_handler; exit 1; }`, provides clear, unambiguous, and predictable error handling.
  • The `trap` command, particularly with `EXIT` and `ERR` signals, offers a robust, centralized mechanism for cleanup and notification, ensuring critical logic runs regardless of where or why a script fails.

What's a widely accepted

Senior DevOps Engineer Darian Vance explains why the ‘set -e’ best practice in shell scripts is often dangerously unreliable. Learn robust, real-world alternatives for error handling that won’t silently break your production systems.

That One “Best Practice” I Threw in the Trash: The Truth About ‘set -e’

It was 3 AM. A PagerDuty alert screamed about potential data corruption on prod-db-01. The nightly backup script, a sacred cow nobody had touched in years, had been logging “success” for over a week. But it wasn’t succeeding. The underlying rsync command was failing due to a permissions change on the NAS target, but a tricky interaction with set -e in a command substitution meant the script exited before it could run the failure notification function. It just died silently in the dark. That night, after we manually recovered the backups, I declared a personal war on the blind worship of set -e.

The “Why”: Best Intentions, Dangerous Reality

The gospel handed down on high says set -e (or its long-form alias, set -o errexit) makes your scripts safer. The idea is simple: the script will exit immediately if any command fails. It sounds great, right? Fail fast, prevent chaos. The problem is that the ‘e’ in set -e might as well stand for ‘erratic’, because its behavior is riddled with exceptions that will burn you.

It’s a landmine because it creates a false sense of security. You think you’re protected, but here are just a few common situations where set -e will completely ignore a command failure:

  • When the failing command is part of a conditional, like an if or while statement.
  • When the failing command is on the left side of a pipe (|), unless you also use set -o pipefail.
  • When the command’s result is being negated with !.
  • When a command is part of a list connected by && or || (except for the last command).

This inconsistency is the real danger. The cognitive overhead of tracking when it will or won’t trigger is more dangerous than just handling your errors deliberately. Let’s talk about how to do that.

Solution 1: The Paranoid Check (Quick & Clear)

Instead of relying on global, unpredictable magic, be explicit where it matters most. For any command that absolutely must succeed for the script to be valid, check its exit code ($?) immediately after it runs. It’s more verbose, but it’s as clear as a bell and has zero ambiguity.

Consider this script snippet:

# The "I hope set -e works" method
set -e
pg_dump -U postgres -h prod-db-01 my_database > /mnt/backups/db.sql
tar -czf /mnt/backups/db-backup-$(date +%F).tar.gz /mnt/backups/db.sql

A much safer, more explicit version looks like this:

# The explicit, "I trust nothing" method
pg_dump -U postgres -h prod-db-01 my_database > /mnt/backups/db.sql
if [ $? -ne 0 ]; then
  echo "FATAL: pg_dump failed for prod-db-01! Aborting."
  # send_pagerduty_alert "DB Backup Failed"
  exit 1
fi

tar -czf /mnt/backups/db-backup-$(date +%F).tar.gz /mnt/backups/db.sql
if [ $? -ne 0 ]; then
  echo "FATAL: tar command failed! Incomplete backup artifact."
  # send_pagerduty_alert "Backup Compression Failed"
  exit 1
fi

echo "Backup completed successfully."

Pro Tip: You can shorten the check with a boolean operator: command || { echo "It failed!"; exit 1; }. It’s concise and achieves the same explicit check.

Solution 2: The Grown-Up Script with Traps

For any script that isn’t just a simple sequence of commands, you need real error handling. This is where trap comes in. A trap is a command that executes when your script receives a certain signal. We can set a trap to run a cleanup function on ERR (any command fails) or EXIT (the script finishes for any reason).

This approach gives you a centralized place to handle cleanup (like removing temp files) and sending notifications, making your scripts incredibly robust.

#!/bin/bash
set -o nounset  # Treat unset variables as an error
set -o pipefail # This one is actually useful and I recommend it

# Define a cleanup function
function cleanup {
  local exit_code=$?
  echo "---"
  echo "Executing cleanup..."
  rm -f /tmp/backup.lock

  if [ ${exit_code} -ne 0 ]; then
    echo "SCRIPT FAILED with exit code ${exit_code}"
  else
    echo "Script finished successfully."
  fi
}

# Set the trap: run the 'cleanup' function on ERR and EXIT
trap cleanup EXIT ERR

echo "Creating lockfile..."
touch /tmp/backup.lock || { echo "Failed to create lockfile"; exit 1; }

echo "Running a command that will succeed..."
ls -l / > /dev/null

echo "Running a command that will fail..."
# This will trigger the ERR trap, which then runs our cleanup function
grep "this-pattern-does-not-exist" /etc/hostname

# This part of the script will never be reached
echo "This line is unreachable because the script will exit."

Using trap is my default for any automation that runs in production. It guarantees that my logging and cleanup logic will run, no matter where or why the script fails.

Solution 3: The “Heresy” Option — Just Stop Using It

Alright, here’s the hot take I brought back from that 3 AM incident. My team’s new standard is to not use set -e by default in our scripts. It’s heresy to some, but it’s pragmatic.

We replaced a magical, unreliable safety net with a simple, conscious engineering decision:

  1. For each command, we ask: “What happens if this fails?”
  2. If the answer is “nothing important,” we let it be. A failed rm of a temp file that might not exist is fine.
  3. If the answer is “the rest of the script is pointless or dangerous,” we add an explicit check right after it (like in Solution 1).

This forces us to think about our failure domains. It’s more work up front, but it completely eliminates the category of “I thought set -e would save me” bugs. We trade a false sense of security for intentional, predictable code.

Warning: This is not an excuse to write sloppy code! It’s a call to be more deliberate about your error handling, rather than outsourcing that thinking to a global setting that can and will betray you when you least expect it.

Stop treating “best practices” as unbreakable laws. Understand the why behind them. In the case of set -e, the original goal was to prevent silent failures. Ironically, its own unpredictable behavior is one of the biggest causes of them. Be explicit, be deliberate, and you’ll sleep better at night. Trust me.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ What are the main issues with relying on `set -e` in shell scripts?

`set -e` is unreliable because it has numerous exceptions where it won’t exit on failure, such as when a command is part of a conditional (`if`/`while`), on the left side of a pipe (without `pipefail`), when negated with `!`, or within `&&`/`||` lists (except the last command). This inconsistency leads to silent failures and a false sense of security.

âť“ How do explicit error checks compare to `set -e` for ensuring script reliability?

Explicit error checks using `if [ $? -ne 0 ]` or `command || { … }` are more verbose but offer clear, predictable, and unambiguous error handling. In contrast, `set -e` is concise but its erratic behavior and numerous exceptions make it dangerously unreliable, often causing silent failures that are difficult to diagnose.

âť“ What is a common pitfall when trying to ensure script reliability, and how can `trap` help?

A common pitfall is assuming `set -e` will reliably catch all command failures, leading to silent issues. The `trap` command helps by allowing you to define a `cleanup` function that executes on specific signals like `ERR` (any command fails) or `EXIT` (script finishes), guaranteeing that cleanup, logging, and notification logic runs consistently, regardless of the script’s exit path.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading