SSH logins hang after sshd restart — duplicate sshd masters / notify timeout EL8

Symptoms

  • New SSH connections hang immediately after the TCP connection is established. The client shows the connection open and then stalls before the server banner (e.g. stops right after Local version string ... with ssh -vvv).
  • The same hang occurs over loopback (ssh user@127.0.0.1), confirming the problem is local to the host, not the network or the client.
  • sshd logs show nothing for the hanging attempts — no accept, no auth, no PAM/2FA activity.
  • Any existing, already-authenticated SSH session continues to work normally.
  • systemctl status sshd reports the service stuck in activating (auto-restart) with Result: timeout.
  • The journal shows a repeating cycle of:
    • start operation timed out. Terminating.
    • Failed with result 'timeout'.
    • RestartSec=...s expired, scheduling restart.
    • Found left-over process <pid> (sshd) in control group while starting unit. Ignoring.
  • ps shows many /usr/sbin/sshd -D master processes with parent PID 1, accumulating over time with staggered start times.

Root cause

The OpenSSH package was updated, but the sshd service was never restarted afterward, so the running daemon continued serving on the old binary. The first restart after the update launched the new binary, which failed to complete the systemd readiness notification expected by the unit’s Type=notify setting.

When the readiness notification never arrives within the start timeout, systemd marks the start as failed and terminates it. Because the stock unit uses KillMode=process, only the main sshd PID is signalled — leftover listener processes survive and detach (reparenting to PID 1). With Restart=on-failure, systemd then waits RestartSec and tries again, repeating the cycle.

The result is multiple orphaned sshd master processes all bound to the SSH port. Incoming connections get accepted by a wedged master that never services them, producing the “TCP connects, no banner, hangs” symptom with nothing logged.

This commonly surfaces long after the package update — the host keeps running fine on the old in-memory binary until the first service restart (manual, automated, or at reboot) exposes the problem.


Resolution

Perform these steps from an existing, already-open SSH session or console that is still working. Do not close that session until SSH is confirmed healthy. The cleanup commands target only the /usr/sbin/sshd -D daemon pattern, which does not match interactive login sessions (those appear as sshd: user [priv] / sshd: user@pts/N).

1. Stop the restart loop and clear the failure state

sudo systemctl stop sshd
sudo systemctl reset-failed sshd

2. Identify the orphaned daemon masters

# Orphaned daemons to be cleared (parent PID 1):
ps -eo pid,ppid,stat,cmd | grep '[s]shd -D'

# Confirm your protected login session(s) — these must NOT be killed:
ps -eo pid,ppid,stat,cmd | grep '[s]shd'

Interactive sessions show as sshd: <user> [priv] and sshd: <user>@pts/N. They do not match the sshd -D pattern used below.

3. Clear the orphaned masters

sudo pkill -9 -f '/usr/sbin/sshd -D'

-9 is used because the wedged masters typically do not respond to a normal TERM. The -f '/usr/sbin/sshd -D' pattern matches only daemon masters, never login sessions.

4. Confirm the port is free and only your session remains

ps -eo pid,ppid,stat,cmd | grep '[s]shd'   # expect only your login session(s)
sudo ss -tlnp | grep ':22'                 # expect no output

5. Validate config and start cleanly

sudo sshd -t && echo "config OK"
sudo systemctl daemon-reload
sudo systemctl reset-failed sshd
sudo systemctl start sshd
sleep 3
systemctl status sshd --no-pager | head -8
sudo ss -tlnp | grep ':22'                 # expect exactly one sshd listener

You want Active: active (running) with a single main PID and one listener.

6. Verify before relying on it

ssh -vvv <user>@127.0.0.1     # loopback first

Then connect from a separate, fresh client. Only after a new login succeeds should the original safety session be closed.


If the clean start still times out

If systemctl start sshd returns to activating (auto-restart) / Result: timeout, the new binary is genuinely not completing the Type=notify handshake. Apply a service drop-in so systemd no longer waits for a notification it will not receive:

sudo mkdir -p /etc/systemd/system/sshd.service.d
printf '[Service]\nType=simple\n' | sudo tee /etc/systemd/system/sshd.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl reset-failed sshd
sudo systemctl start sshd
sleep 3
systemctl status sshd --no-pager | head -8
sudo ss -tlnp | grep ':22'

Type=simple tells systemd to consider the service started once the process is running, rather than waiting on a readiness notification. This is a low-risk, persistent workaround. The override survives reboots.


Prevention

  • Restart services after package updates. A patched binary does not take effect until the service is restarted. Long gaps between update and restart hide problems until the next restart or reboot.
  • Use a post-update check to find stale binaries (requires dnf-utils / yum-utils):
sudo needs-restarting -s   # services running outdated binaries
sudo needs-restarting -r   # whether a full reboot is advised
  • Schedule reboots for when console access is available. A reboot resolves the orphaned-process state, but if the underlying cause were a persistent config fault instead, the host could come back with no working SSH and no live session to recover from. Reboot once the failure is confirmed to be transient state, and do it with out-of-band/console access on hand.
  • Keep at least one known-good session open while troubleshooting SSH so you retain a recovery path.

Quick reference

Step Command
Stop loop sudo systemctl stop sshd
Clear failure state sudo systemctl reset-failed sshd
Find orphans ps -eo pid,ppid,stat,cmd | grep '[s]shd -D'
Clear orphans sudo pkill -9 -f '/usr/sbin/sshd -D'
Confirm port free sudo ss -tlnp | grep ':22'
Validate config sudo sshd -t
Start sudo systemctl start sshd
Workaround (if notify still times out) drop-in Type=simple