🚀 Executive Summary

TL;DR: OT devices can brick after a reboot because critical services depend on temporary files in `tmpfs` directories, which are lost during a power cycle. The permanent solution involves using `systemd-tmpfiles` to declaratively ensure these essential directories are recreated with correct permissions on every boot.

🎯 Key Takeaways

  • Critical services on Linux systems, particularly OT devices, can fail to start after a power cycle if they rely on ephemeral files or directories located in `tmpfs` (RAM-based filesystems like `/var/run`).
  • The `systemd-tmpfiles` mechanism provides a permanent and idempotent solution to ensure required directories are automatically created with correct permissions on every system boot, preventing service startup failures.
  • In emergency situations, a manual override via console access (recreating directories and setting ownership) can quickly restore service, but it’s a temporary fix that doesn’t persist across reboots.

Have you tried turning it off and on again? On bricking OT devices (part 2)

When “turning it off and on again” bricks your OT devices or servers, it’s often due to services depending on temporary files in `tmpfs` that vanish on reboot. Learn the real cause and discover three practical fixes, from a quick manual intervention to a permanent systemd solution.

That Time a Reboot Bricked a Server, and What We Learned About OT Devices

I still remember the Slack message that lit up my phone at 2 AM. It was from Alex, one of our sharp junior engineers. “Darian, `ot-data-collector-03` is down. I tried a reboot from the console, and now it won’t come back. At all. I think I bricked it.” The feeling in the pit of my stomach was all too familiar. We’ve all been there: a simple, routine “turn it off and on again” that goes catastrophically wrong, turning a minor hiccup into a full-blown outage. That night, the fix was simple, but the lesson was critical, especially for anyone dealing with Operational Technology (OT) or any locked-down appliance-like systems.

The “Why”: The Ghost in the RAM

So, what actually happened? Why did the most trusted tool in the IT arsenal fail so spectacularly? The culprit wasn’t a hardware failure or a corrupted disk. The problem was `tmpfs`.

Many modern Linux systems, for performance and to reduce wear on flash storage, mount certain directories like /var/run or parts of /var/lock as a `tmpfs` filesystem. This means the directory exists entirely in RAM. It’s blazing fast, but it has one massive drawback: it’s completely ephemeral. When you power cycle the machine—not a graceful reboot, but a hard power-off-power-on—everything in that RAM-based directory vanishes into the digital ether. If a critical service, like Docker or a custom data collector, needs a specific subdirectory or a socket file in there to start, it will fail on boot because its home is gone.

The system boots up, tries to start `your-critical-app.service`, and the service screams, “I can’t find `/var/run/my-app/`!” and promptly falls over. The machine is running, but from the outside, it’s a brick.

The Fixes: From Duct Tape to Reinforcement

Alright, enough theory. You’re in the hot seat, a critical device is down, and people are waiting. Here’s how you fix it, starting with the fastest (and dirtiest) method and moving to the one that’ll let you sleep at night.

1. The Quick Fix: “The Manual Override”

This is the “get it working NOW” approach. It’s hacky, it won’t survive the next reboot, but it will stop the bleeding. You’ll need console access to the machine (like the vSphere console or a direct connection).

First, figure out what directory is missing by checking the service’s status or logs. You’ll likely see an error message.

systemctl status docker.service
# ● docker.service - Docker Application Container Engine
# ...
# FAILED: failed to start daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid

In this case, the service is complaining about a PID file, but often the issue is that the directory itself doesn’t exist. Let’s assume our `ot-data-collector.service` needs `/var/run/ot-collector/`.

You just manually recreate what the service expects:

# Log in via the emergency console
mkdir /var/run/ot-collector
chown ot-user:ot-group /var/run/ot-collector
systemctl start ot-data-collector.service

Boom. The service should start, and you’re back online. Now, go file a ticket to implement the permanent fix before this happens again next month.

2. The Permanent Fix: “The `systemd-tmpfiles` Way”

This is the correct way to solve the problem. Instead of fixing it after it breaks, we’re going to tell the system how to prepare the environment on every single boot. We’ll use `systemd-tmpfiles` to declaratively create the directories we need.

You create a simple configuration file in /etc/tmpfiles.d/. Let’s call it ot-collector.conf.

# /etc/tmpfiles.d/ot-collector.conf

# Type  Path                    Mode    UID       GID       Age   Argument
d       /var/run/ot-collector   0755    ot-user   ot-group  -     -

Let’s break that down:

  • d: Create a directory if it doesn’t exist.
  • Path: The full path to the directory.
  • Mode: The permissions (like `chmod`).
  • UID/GID: The user and group that should own it.
  • Age/Argument: We can ignore these for this use case.

Once that file is in place, `systemd` will automatically run it on boot, ensuring your directory is ready and waiting *before* your service tries to start. This is idempotent, automated, and the way it should have been configured in the first place.

A Word of Warning: Resist the urge to just `chmod 777` the directory. Figure out the correct user the service runs as (`ot-user` in our example) and give it the minimal permissions it needs. Sloppy permissions are how security incidents begin.

3. The ‘Nuclear’ Option: “Pave and Redeploy”

Sometimes, you can’t get console access. Maybe it’s a remote OT device in a factory with no remote hands available, or a VM so borked that the emergency console won’t even respond. If you can’t get a shell, you can’t fix it manually.

This is where your disaster recovery and CI/CD practices save you. The fix is to destroy the bricked instance and redeploy it from your golden image or Infrastructure as Code definition (Terraform, Ansible, etc.).

This option highlights a critical point: treat your servers like cattle, not pets. If `ot-data-collector-03` dies, you shouldn’t be mourning its loss. You should be able to spin up `ot-data-collector-04` from a known-good configuration in minutes. Of course, this only works if your “known-good” configuration includes the permanent fix we just discussed!

Which Fix Should You Choose?

Here’s a quick cheat sheet for when to use each approach.

Solution When to Use It Pros Cons
Manual Override During an active outage when speed is everything. Fastest way to restore service. Not permanent; problem will reoccur.
`systemd-tmpfiles` The default, correct solution for all new and existing systems. Permanent, idempotent, best practice. Requires root access and a configuration change.
Pave and Redeploy When you have no shell access or the system is deeply compromised. Tests your DR/automation; results in a clean state. Can be slower; requires a mature IaC setup.

That 2 AM incident with Alex ended with the “Manual Override” to get the system back, followed by a post-mortem where we rolled out the `systemd-tmpfiles` fix across the entire fleet the next day. A painful night, but it led to a more resilient system. So next time a reboot goes wrong, don’t panic. Just remember the ghosts in the RAM.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

❓ Why do OT devices sometimes fail to restart after a reboot, even if the hardware is fine?

This often occurs because critical services depend on temporary files or directories (e.g., in `/var/run`) that are stored in `tmpfs` (RAM). During a power cycle, these ephemeral `tmpfs` contents are lost, preventing services from finding their required environment upon restart.

❓ What is the recommended long-term solution to prevent services from failing due to missing `tmpfs` directories on reboot?

The recommended long-term solution is to use `systemd-tmpfiles`. By creating a configuration file in `/etc/tmpfiles.d/` (e.g., `ot-collector.conf`), you can declaratively define the directories, permissions, and ownership that `systemd` will automatically create on every boot.

❓ What is a common security pitfall when manually fixing missing directories for services, and how can it be avoided?

A common pitfall is using overly permissive permissions like `chmod 777`. To avoid this, identify the specific user and group (`UID/GID`) the service runs as and apply only the minimal necessary permissions (e.g., `0755`) to the recreated directory.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading