Skip to content

fix(model-refresh): refresh HTTP client on transport errors#547

Open
linusmixson wants to merge 4 commits intoSoju06:mainfrom
linusmixson:fix-reconnection
Open

fix(model-refresh): refresh HTTP client on transport errors#547
linusmixson wants to merge 4 commits intoSoju06:mainfrom
linusmixson:fix-reconnection

Conversation

@linusmixson
Copy link
Copy Markdown

Recreating from #502, which appears to have gotten lost.


Summary

This PR makes model refresh recover from transport-level upstream failures by rebuilding the shared aiohttp client and retrying the fetch once.

Handled failures now include compact error details, while unexpected exceptions still keep stack traces.

Observed Bug Evidence

Running codex-lb via uvx in low-reliability network environments (VPN, laptop sleep/wake, etc.) often resulted in the software entering a state in which the upstream connection was lost and could not be regained without restarting the process. These present changes make codex-lb behave seamlessly under the same condtions.

Root Cause

The model refresh path uses the shared aiohttp.ClientSession.

When the network changes underneath the process, the model fetch can fail before an HTTP response is received, surfacing as transport exceptions.

Previously, model refresh treated these the same as other model-fetch failures. It did not rebuild the shared HTTP client, so a bad/stale session or connector could continue to be reused after the network came back.

Fix

Model-fetch transport exceptions are now wrapped as ModelFetchError with transport_error=True.

When the scheduler sees the first transport error in a refresh cycle, it refreshes the shared HTTP client, closes the previous client, and retries the model fetch once.

HTTP status errors, auth failures, and invalid upstream responses stay on the existing paths.

The recovery logs now include the original transport error tersely, so failures are easier to diagnose without turning expected retry paths into stack traces.


  • changes post-review. Model's comments:

Changes address the PR review’s main blocker by replacing immediate shared HTTP client teardown with lease-aware client rotation.

The P1 race is handled in app/core/clients/http.py:25: the global client is now wrapped in _ManagedHttpClient, tracks active_leases, and only closes a retired client after all active leases release. refresh_http_client() now swaps in the replacement under the lock and requests close on the previous managed client instead of synchronously tearing it down while proxy/model/usage requests may still hold it (app/core/clients/http.py:197). Shutdown still force-closes active and retired clients so process exit does not hang on long-lived streams (app/core/clients/http.py:209).

Call sites that use the shared session/retry client were moved onto lease helpers, so in-flight work pins the client it started with. This includes model fetch, token refresh, OAuth flows, proxy streaming/compact/transcription, and usage fetch. The compatibility get_http_client() remains, but now documents that network use should lease the client (app/core/clients/http.py:227).

The model refresh recovery path was also broadened to cover token-refresh transport failures, not just model-fetch transport failures. RefreshError now carries transport_error, token refresh maps aiohttp.ClientError, timeout, and OSError into non-permanent transport errors (app/core/auth/refresh.py:40), and the scheduler performs one shared-client rotation/retry per failover loop for either refresh or fetch transport errors (app/core/openai/ model_refresh_scheduler.py:145).

The CI type-check comments from the review are fixed in tests/unit/test_model_refresh_scheduler.py:20: _account() now returns a real Account instead of a SimpleNamespace, and the unused suppression is gone. I ran uv run ty check; it passes.

Tests were expanded around the new behavior: HTTP client lease/drain semantics, forced shutdown, model-fetch transport retry, token-refresh transport retry, one-shot recovery across failover, and avoiding cached cooldown reuse for temporary transport refresh failures.


Test Plan

  • uv run ruff check app/core/openai/model_refresh_scheduler.py tests/unit/test_model_refresh_scheduler.py
  • uv run pytest tests/unit/test_model_refresh_scheduler.py
  • uv run pytest tests/unit/test_http_client.py
  • uv run pytest tests/unit/test_model_refresh_scheduler.py
  • git diff --check

Retry model fetches after transport failures by refreshing the shared HTTP client.
  - Add transport-error classification to model fetch failures.
  - Refresh the shared HTTP client once on transport errors, then retry the fetch with the same account.
  - Keep the retry path from cascading into repeated reconnect attempts.
  - Add unit coverage for transport error detection, HTTP client refresh replacement, and failover retry behavior.

Agent: GPT-5.4; GPT-5.5
Agent: GPT-5.4-mini; GPT-5.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant