Skip to content

Conversation

@deansheather
Copy link
Member

Prevents thundering herd issues with SSH connections by adding health tracking, exponential backoff, and singleflighting to the connection pool.

Changes

  • SSHConnectionPool class with:

    • Health tracking (healthy/unhealthy/unknown states)
    • Exponential backoff: 1s → 5s → 10s → 20s → 40s → 60s (cap)
    • Singleflighting: concurrent probes to same host share one attempt
    • Fast-path for known-healthy connections (no re-probe)
  • Integration points:

    • SSHRuntime.exec() and execSSHCommand() call acquireConnection()
    • PTYService calls acquireConnection() before spawning SSH terminals

Flow

acquireConnection() → in backoff? → throw immediately
                   → known healthy? → return immediately  
                   → inflight probe? → wait on existing promise
                   → start probe → success? → mark healthy, return
                                 → failure? → mark failed + backoff, throw

Generated with mux

@deansheather deansheather force-pushed the ssh-mux-connection-backoff branch from cfa5fa8 to f611745 Compare December 5, 2025 03:19
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Prevents thundering herd issues with SSH connections by:

- Adding SSHConnectionPool class with health tracking
- Implementing exponential backoff (1s → 5s → 10s → 20s → 40s → 60s cap)
- Singleflighting concurrent connection attempts to same host
- Probing unknown connections before first use
- Skipping probes for known-healthy connections

Integration points:
- SSHRuntime.exec() and execSSHCommand() call acquireConnection()
- PTYService calls acquireConnection() before spawning SSH terminals

_Generated with mux_
@deansheather deansheather force-pushed the ssh-mux-connection-backoff branch from f611745 to 53de00a Compare December 5, 2025 03:23
@deansheather
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 213 to 223
} else if (runtime instanceof SSHRuntime) {
// SSH: Use node-pty to spawn SSH with local PTY (enables resize support)
const sshConfig = runtime.getConfig();

// Ensure connection is healthy before spawning terminal
// This provides backoff protection and singleflighting for concurrent requests
await sshConnectionPool.acquireConnection(sshConfig);

const sshArgs = buildSSHArgs(sshConfig, workspacePath);

log.info(`[PTY] SSH terminal for ${sessionId}: ssh ${sshArgs.join(" ")}`);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Terminal failures never update SSH pool health

This block now marks the SSH target as healthy via sshConnectionPool.acquireConnection before spawning a terminal, but the PTY path never reports subsequent SSH failures back to the pool. If a host was previously healthy and later goes down, acquireConnection will keep fast-pathing because the cached status remains healthy, and the on-exit handler does not call reportFailure when ssh exits with code 255 or fails to spawn. As a result, terminal creation will loop without backoff or reprobe, defeating the new thundering-herd protection for PTY sessions once a healthy host becomes unreachable.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant