🚀 Executive Summary

TL;DR: CI/CD pipelines frequently fail with SSH ‘Permission denied’ errors because runners lack proper SSH key configuration and permissions in their ephemeral environments. This guide offers solutions ranging from temporary fixes to robust secret management and identity-based authentication to resolve these issues permanently.

🎯 Key Takeaways

  • CI/CD runners are ephemeral, non-interactive environments that require explicit SSH key configuration, including correct file permissions (e.g., `chmod 600`) and proper user context.
  • The recommended and most secure solution for managing SSH keys in CI/CD is to leverage the platform’s built-in secret management (e.g., GitLab CI/CD variables) to inject and load keys via `ssh-agent`.
  • For modern cloud or large-scale environments, adopting identity-based authentication (like IAM Roles, Service Accounts, or HashiCorp Vault) eliminates static SSH key management, offering enhanced security and scalability.

Help, looking for advice

Tired of your CI/CD pipeline failing with SSH ‘Permission denied’ errors? A senior engineer breaks down the root cause and provides three practical solutions, from a quick fix to a permanent architecture change.

SSH “Permission Denied” in CI/CD? A Senior Engineer’s Guide to Fixing It For Good

I remember it like it was yesterday. It was 2 AM, the PagerDuty alarm was screaming, and a critical hotfix deployment was failing. The error? “Permission denied (publickey).” A simple, infuriating error. After an hour of frantic debugging, we found the culprit: a well-meaning engineer had manually “fixed” the SSH keys on our Jenkins agent, `build-agent-02`, earlier that day, but a container restart had wiped it all away. We were trying to push to a server from an environment that had forgotten its own name. It’s a rite of passage, I guess, but one that I want to help you skip.

The Real “Why”: Your Runner Isn’t You

When you see that “Permission denied” error in a pipeline log, the first instinct is to say, “But it works on my machine!” And you’re right, it does. The problem is that the CI/CD runner is not your machine. It’s a separate, non-interactive environment running commands as a specific service user (like gitlab-runner or jenkins).

This environment is often sterile and ephemeral. It doesn’t have your personal ~/.ssh/config, it doesn’t have your keys loaded into an agent, and the permissions on any file you create are probably wrong. The SSH client is very picky for security reasons. It fails for three main reasons:

  • The Wrong Key: The runner doesn’t have the private key needed to authenticate with the target server (e.g., prod-db-01).
  • The Wrong Permissions: Even if the key exists, its file permissions are too open. SSH requires private key files to be readable only by the owner (chmod 600).
  • The Wrong User: The pipeline is running as the gitlab-runner user, but the key is located in /root/.ssh/ or some other inaccessible directory.

The Fixes: From Duct Tape to a New Engine

Alright, let’s get you unstuck. I’m going to walk you through three ways to solve this, ranging from the “I need this working by lunch” approach to the “let’s never have this problem again” architecture.

Solution 1: The Quick Fix (The “Sudo and Chmod” Shuffle)

This is the down-and-dirty method. You SSH into the runner itself (if you even can) and manually configure the key. It’s fast, it’s direct, but it’s incredibly brittle. If your runner is an ephemeral Docker container, this fix will last exactly as long as the current pipeline run.

Steps:

  1. Place the private key content into a file on the runner, for example, /home/gitlab-runner/.ssh/id_rsa_deploy.
  2. Immediately fix the permissions. This is the step everyone forgets.
  3. chmod 600 /home/gitlab-runner/.ssh/id_rsa_deploy
  4. Tell SSH to use this specific key for the connection.
  5. ssh -i /home/gitlab-runner/.ssh/id_rsa_deploy user@prod-web-cluster 'uptime'

Warning: I call this the “Friday Afternoon Fix.” It feels good to get the green checkmark, but this manual setup is technical debt. When that runner gets replaced or rebuilt, your deployment will break again, and no one will remember why.

Solution 2: The Permanent Fix (Using Your CI/CD’s Secret Management)

This is the way. Your CI/CD platform (GitHub Actions, GitLab CI, Jenkins, etc.) has a built-in, secure way to handle secrets like SSH keys. You store the key as a variable and inject it into the pipeline at runtime. This is repeatable, secure, and infrastructure-agnostic.

Steps (GitLab CI/CD Example):

  1. Go to your project’s Settings > CI/CD > Variables.
  2. Create a new variable. Let’s call it SSH_PRIVATE_KEY.
  3. Paste the entire content of your private key into the value field. Make sure you copy everything, including the -----BEGIN... and -----END... lines.
  4. Mark the variable as ‘Protected’ (so it only runs on protected branches) and ‘Masked’ (to hide it from logs).
  5. In your .gitlab-ci.yml file, use a before_script to load the key.
deploy_job:
  image: alpine:latest
  before_script:
    - 'which ssh-agent || ( apk add --update openssh )'
    - eval $(ssh-agent -s)
    - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
  script:
    - ssh user@prod-web-cluster 'whoami'

This script snippet ensures ssh-agent is running, adds your key from the environment variable, and then the rest of your script can use SSH freely without needing to specify an identity file.

Solution 3: The ‘Nuclear’ Option (Rethinking Authentication)

Sometimes, the problem isn’t the key; it’s the fact that you’re using keys at all. In a modern cloud environment, managing individual keys for dozens of services and servers is a nightmare. This is where I put on my Cloud Architect hat.

This option involves moving away from static SSH keys for machine-to-machine communication and using a more dynamic, identity-based approach.

  • For Cloud VMs (AWS, GCP, Azure): Use IAM Roles or Managed Identities. Your CI runner instance is granted a role that gives it permission to access other resources (like an S3 bucket for deployment artifacts, or the permission to run commands via AWS SSM). There are no keys to manage. It “just works” based on the instance’s identity.
  • For Kubernetes: Use Service Accounts. Your deployment pod is given a service account with specific RBAC permissions to interact with the Kubernetes API or other services within the cluster.
  • For On-Prem/Hybrid: Use a Secrets Broker like HashiCorp Vault. Your CI/CD runner authenticates with Vault using a trusted identity (like a token or AppRole), and Vault generates a short-lived credential (like a temporary SSH certificate or database password) just for that pipeline run. The credential expires automatically.

Pro Tip: This is a significant architectural change, not a quick fix. But if you’re constantly fighting with key management, rotation policies, and “Permission denied” errors, it’s time to have a serious conversation about adopting an identity-based, keyless workflow. It will pay for itself in saved time and fewer 2 AM wake-up calls.

Which Path Should You Choose?

Here’s my breakdown to help you decide.

Solution Pros Cons My Advice
1. The Quick Fix Fastest to implement for a single, stable runner. Extremely brittle; not secure; breaks with ephemeral runners. Use it to understand the problem, then immediately move to Solution 2.
2. The Permanent Fix Secure; repeatable; standard industry practice. Requires understanding your CI/CD platform’s secrets system. This should be your default for 90% of use cases. Do this.
3. The ‘Nuclear’ Option Most secure; eliminates entire class of problems; highly scalable. Complex to set up; requires major architectural changes. Adopt this when you’re building a new platform or when key management itself becomes a full-time job.

Don’t feel bad about hitting this wall. We’ve all been there. The key is to understand why it’s failing and to choose a fix that doesn’t just solve the problem today but prevents it from happening again tomorrow. Now, go get that pipeline green.

Darian Vance - Lead Cloud Architect

Darian Vance

Lead Cloud Architect & DevOps Strategist

With over 12 years in system architecture and automation, Darian specializes in simplifying complex cloud infrastructures. An advocate for open-source solutions, he founded TechResolve to provide engineers with actionable, battle-tested troubleshooting guides and robust software alternatives.


🤖 Frequently Asked Questions

âť“ Why do I get ‘Permission denied (publickey)’ in my CI/CD pipeline when SSH works locally?

CI/CD runners are sterile, ephemeral environments that don’t have your personal SSH configurations, keys loaded into an agent, or correct file permissions. They run as a specific service user and require the private key to be explicitly provided with `chmod 600` permissions.

âť“ How do the different SSH key management solutions for CI/CD compare in terms of security and scalability?

The ‘Quick Fix’ is brittle and insecure, only suitable for immediate debugging. Using CI/CD secret management is secure, repeatable, and standard practice for most use cases. The ‘Nuclear Option’ (identity-based authentication like IAM Roles or Vault) is the most secure and scalable, eliminating static key management but requires significant architectural changes.

âť“ What is a common implementation pitfall when setting up SSH keys in CI/CD, even after placing the key?

A common pitfall is forgetting to set the correct file permissions for the private key. SSH strictly requires private key files to be readable only by the owner, so `chmod 600` must be applied to the key file, otherwise, the connection will fail with ‘Permission denied’.

Leave a Reply

Discover more from TechResolve - SaaS Troubleshooting & Software Alternatives

Subscribe now to keep reading and get access to the full archive.

Continue reading