Skip to content

Control-agent blocks on synchronous worker spawns, causing complete unresponsiveness #185

@vltbaudbot

Description

@vltbaudbot

Summary

The control-agent can become completely unresponsive for extended periods (20+ minutes observed) when worker spawn commands block the main event loop. This creates a critical outage where all incoming messages are queued but never processed.

Problem Statement

Symptom: Control-agent stops processing inbox messages while bridge continues to receive and forward them.

Root Cause: The control-agent executes pi session spawn commands synchronously via the bash tool. If the spawn process hangs or takes a long time, the control-agent cannot process any other messages until it completes.

Impact:

  • Complete loss of responsiveness to incoming requests
  • Messages queue up with no acknowledgment or processing
  • Users experience total silence from the system
  • No automatic detection or recovery
  • Requires manual intervention to restore service

Reproduction

  1. Have control-agent with message queue
  2. Execute worker spawn: pi session spawn --name "dev-agent-..." --skill dev-agent ...
  3. If spawn fails/hangs (e.g., missing API key, config error):
    • Spawn retries multiple times
    • Each retry takes 30-60 seconds
    • Total: 5-10 minutes of blocking
  4. Meanwhile: All incoming messages queue up unprocessed

Proposed Solutions

1. Non-Blocking Worker Spawns (Critical Priority)

Change: Always background worker spawns and verify separately

Before (blocks):

pi session spawn --name "dev-agent-..." --skill dev-agent --model X 2>&1 | tail -5

After (non-blocking):

# Spawn in background
(pi session spawn \
  --name "dev-agent-..." \
  --skill dev-agent \
  --model X \
  > /tmp/spawn-${WORKER_ID}.log 2>&1 &)

# Wait briefly for socket
sleep 3

# Verify spawn succeeded
if [ -S ~/.pi/session-control/${WORKER_NAME}.sock ]; then
  echo "Worker spawned successfully"
  # Send task immediately
  send_to_session --sessionName "${WORKER_NAME}" --message "..."
else
  echo "Worker spawn failed - check logs"
fi
# Continue processing inbox immediately!

Implementation suggestions:

  • Update control-agent skill documentation with non-blocking spawn pattern
  • Add helper function/script for safe worker spawning
  • Warn in dev-agent docs about spawn timing

2. Default Timeouts on Bash Tool (Critical Priority)

Problem: Bash commands can run indefinitely

Solution: Add configurable timeout with reasonable default

interface BashToolOptions {
  command: string;
  timeout?: number; // milliseconds, default: 300000 (5 min)
}

Implementation:

  • Default timeout: 5 minutes
  • Allow override for known long operations (CI polling, builds)
  • Kill process tree on timeout
  • Return timeout error to caller

3. Async Message Processing (High Priority)

Current: Serial message processing - one message blocks all others

Proposed: Concurrent message handling with worker pool

┌──────────────┐
│ Message Queue│
└──────┬───────┘
       │
   ┌───▼────┬────────┬────────┐
   │Worker 1│Worker 2│Worker 3│  ← Process messages concurrently
   └────────┴────────┴────────┘

Benefits:

  • One slow message doesn't block others
  • Natural load balancing
  • Better resource utilization
  • Graceful degradation under load

Considerations:

  • Message ordering (when required)
  • Shared state management
  • Resource limits (max concurrent)

4. Health Monitoring & Auto-Recovery (High Priority)

Add: Watchdog process that monitors control-agent health

Check every 30-60 seconds:

  • Time since last message processed
  • Queue depth
  • Process responsiveness
  • Active child processes

Auto-recovery when stuck:

  1. Log state to incident file
  2. Identify blocking child processes
  3. Kill blocking processes
  4. Verify recovery
  5. Alert operator

Metrics to expose:

  • Messages processed per minute
  • Average processing time
  • Current queue depth
  • Worker count
  • Time since last activity

5. Circuit Breaker for Operations (Medium Priority)

Pattern: Fail fast when operations consistently fail

Operation fails 3 times → 
  Circuit opens (skip attempts for 5 min) → 
  Retry after timeout → 
  Success → Circuit closes

Apply to:

  • Worker spawns
  • API calls
  • External service calls

6. Graceful Degradation (Medium Priority)

Strategies when under load:

  1. Immediate acknowledgment: Always respond within 5 seconds

    • "Got it! Working on this..."
  2. Queue position updates:

  3. Auto-fallback responses:

    • After 2 min: "System under heavy load, will respond shortly"
  4. Priority queue:

    • Some messages (admins, urgent) jump queue

7. Observability Improvements (Medium Priority)

Add status endpoints/tools:

  • /health - Is control-agent responsive?
  • /metrics - Queue depth, processing rate, worker count
  • /status - Current state, last activity timestamp

Structured logging:

{
  "timestamp": "2026-02-27T04:08:01Z",
  "level": "warn",
  "component": "control-agent",
  "event": "spawn_timeout",
  "worker_id": "dev-agent-xyz",
  "duration_ms": 300000
}

8. Worker Lifecycle Tracking (Low Priority)

Track state transitions:

SPAWNING → READY → ASSIGNED → WORKING → REPORTING → COMPLETE/FAILED

Benefits:

  • Identify slow spawns
  • Detect stuck workers
  • Optimize performance
  • Better resource cleanup

Testing Recommendations

Load Testing

  1. Spawn stress: Spawn 10 workers rapidly, verify no blocking
  2. Message flood: 20 messages in 10 seconds, verify all processed
  3. Long operations: Start 5-min operation, verify other work continues

Chaos Testing

  1. Kill random worker processes
  2. Block network for 5 minutes
  3. Fill disk to 99%
  4. Send malformed messages

Recovery Testing

  1. Verify auto-recovery from blocked state
  2. Verify graceful shutdown
  3. Verify worker cleanup on crash

Success Criteria

  • Control-agent never blocks for >30 seconds
  • All incoming messages acknowledged within 5 seconds
  • Messages processed concurrently when independent
  • System auto-recovers from blocked states
  • Health metrics visible and monitored
  • P95 response time < 30 seconds
  • Uptime > 99.9%

Additional Context

This issue was discovered during production operation when a worker spawn hung for 22 minutes, causing complete control-agent unresponsiveness. The worker itself functioned correctly once spawned, but the control-agent remained blocked waiting for the spawn command to return.

Manual intervention (killing blocking processes) immediately restored service, but highlighted the need for:

  1. Asynchronous operations by default
  2. Automatic health monitoring
  3. Self-recovery mechanisms
  4. Better visibility into system state

Related Work

  • Consider similar patterns in other agents (sentry-agent, dev-agent)
  • Review all bash tool usage for potential blocking operations
  • Audit all external process spawning
  • Review timeout usage across codebase

Priority: Critical
Effort: 2-3 weeks (all high-priority items)
Impact: Eliminates entire class of outage scenarios

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions