🚀 Executive Summary
TL;DR: Long-running automation scripts often terminate unexpectedly when their parent SSH session closes due to the SIGHUP signal. This article explains the ‘why’ behind this behavior and provides three solutions: `nohup` for quick fixes, `systemd` for robust service management, and `systemd-run` for transient, managed processes.
🎯 Key Takeaways
- The SIGHUP signal is sent to child processes when their parent SSH session terminates, causing them to default to termination unless explicitly handled.
- `nohup` combined with `&` provides a quick, fire-and-forget method to detach a process from the terminal and make it immune to SIGHUP, but lacks state tracking and monitoring.
- Defining scripts as `systemd` services is the modern, idempotent, and robust solution, managing processes via PID 1 with proper logging and lifecycle control.
- `systemd-run` allows running a command as a transient `systemd` service on the fly, offering the benefits of `systemd` management without requiring a permanent unit file for dynamic or emergency tasks.
Tired of your automation scripts dying unexpectedly? A Senior DevOps Engineer explains why your long-running processes get killed and provides three real-world solutions to fix it for good.
So, Your Automation Just Ghosted You? Let’s Talk About SIGHUP.
I’ll never forget it. 2 AM, a critical database migration, and I’m watching the Ansible playbook run from our Jenkins controller. The task was simple: kick off a heavy data processing script on prod-db-cluster-01. The playbook logs showed the script started, and then… the Jenkins job finished. “Success!” it said. I went to grab a coffee, feeling pretty good. Twenty minutes later, monitoring alerts start screaming. The new application tables were half-empty, the data was inconsistent, and the old system was already taken offline. The script had died silently the moment the Ansible SSH session closed. That was the night I learned a hard lesson about the operating system’s ruthless efficiency and the treachery of the SIGHUP signal.
The “Why”: Your Script Isn’t a Rebel, It’s an Orphan
When you run a command over SSH, whether it’s you typing in a terminal or an Ansible playbook executing a task, you create a session. That session becomes the “parent” of any process you start. The problem is, when that session ends (your SSH client disconnects, the Ansible task finishes), the operating system does some housekeeping. It sends a “hang-up” signal, or SIGHUP, to the parent process.
By default, that signal cascades down to all of its “child” processes. Your long-running script is a child process. It gets the SIGHUP signal and, unless it’s specifically programmed to ignore it, it does what it’s told: it terminates. It’s not a bug; it’s a feature designed to prevent orphaned processes from running forever. But for us, it’s a massive headache.
Three Ways to Keep Your Scripts Alive
So how do we tell the OS, “Hey, I actually want this one to stick around”? We have a few options, ranging from a quick fix to the architecturally sound solution.
Solution 1: The “Get It Done Now” Fix (nohup & disown)
This is the classic, old-school sysadmin trick. nohup (no hang-up) is a command that runs another command and makes it immune to the SIGHUP signal. We combine it with an ampersand (&) to run the process in the background. It’s quick, it’s dirty, and it works for one-off tasks.
Here’s how you might use it in an Ansible task:
- name: Start long-running data processing script
ansible.builtin.shell:
cmd: "nohup /opt/scripts/process_data.sh > /var/log/data_processing.log 2>&1 &"
chdir: /opt/scripts/
args:
creates: /var/run/data_processing.pid
Warning: This approach is “fire and forget.” Ansible will start the process and move on. It has no idea if the script inside
nohupeventually fails. You lose state and control, making it a poor choice for critical, repeatable automation.
Solution 2: The “Do It Right” Fix (systemd)
If this script is a regular part of your operations, it should be managed like any other service. The modern, idempotent way to handle this is with a service manager like systemd. You’re not just running a script; you’re defining a state for the system to enforce.
The idea is to use your automation tool to define a systemd service unit, and then use the tool to start that service. The process is now managed by the OS itself (specifically, by PID 1), completely detached from your SSH session. It will be properly logged, monitored, and managed.
First, create a service file template (e.g., data-processor.service.j2):
[Unit]
Description=My Long Running Data Processor
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/opt/scripts/process_data.sh
WorkingDirectory=/opt/scripts/
User=appuser
[Install]
WantedBy=multi-user.target
Then, use Ansible to deploy and start it:
- name: Create the systemd service file
ansible.builtin.template:
src: data-processor.service.j2
dest: /etc/systemd/system/data-processor.service
owner: root
group: root
mode: '0644'
notify: Reload systemd
- name: Start the data processing service
ansible.builtin.systemd:
name: data-processor
state: started
enabled: yes
daemon_reload: yes
This is the way. It’s testable, repeatable, and what we expect from proper infrastructure as code.
Solution 3: The “Break Glass” Option (systemd-run)
What if you need the power of systemd without the ceremony of writing a full service file for a one-off task? Meet systemd-run. This command lets you run a process as a transient systemd service, on the fly. It’s incredibly powerful and a great tool for emergencies or dynamic tasks.
It creates a temporary “slice” and “scope” for your command, which is managed directly by systemd. When your SSH session dies, the process lives on because its parent is PID 1, not your session.
Here’s how you’d run our script with it:
# Run this directly on the target server or via an Ansible shell task
systemd-run --unit=my-temp-processor --on-active=30 /opt/scripts/process_data.sh
You can then check its status like any other service: systemctl status my-temp-processor. It’s the perfect middle ground between the hacky nohup and a full-blown service definition.
Comparison at a Glance
| Method | Pros | Cons |
|---|---|---|
| nohup / & | Simple, fast, no dependencies. | Fire-and-forget, hard to monitor, not idempotent. |
| systemd Service | Robust, idempotent, manageable, proper logging. | More setup required (unit file, handler). |
| systemd-run | Powerful, flexible, managed by systemd, no file needed. | More complex syntax, can be overkill for simple tasks. |
Ultimately, the choice depends on your situation. But next time your automation ghosts you, don’t just add a & and hope for the best. Understand the “why,” and choose the tool that fits the job. Your 2 AM self will thank you for it.
🤖 Frequently Asked Questions
âť“ Why do my long-running automation scripts get killed when the SSH session ends?
When an SSH session terminates, the operating system sends a SIGHUP (hang-up) signal to all child processes associated with that session. By default, processes receiving SIGHUP will terminate unless specifically programmed to ignore it.
âť“ How do `nohup`, `systemd` services, and `systemd-run` compare for keeping scripts alive?
`nohup` is simple and fast but fire-and-forget, lacking monitoring. `systemd` services are robust, idempotent, and manageable with proper logging for critical tasks. `systemd-run` is powerful for transient, on-the-fly tasks, managed by systemd without a full service file.
âť“ What is a common implementation pitfall when trying to keep scripts alive?
A common pitfall is using `nohup` for critical automation tasks. While it keeps the script running, it’s a ‘fire-and-forget’ approach, meaning the calling automation tool loses state and control, making it impossible to monitor for failures or manage the script’s lifecycle effectively.
Leave a Reply