fix(model-refresh): refresh HTTP client on transport errors#547
Open
linusmixson wants to merge 4 commits intoSoju06:mainfrom
Open
fix(model-refresh): refresh HTTP client on transport errors#547linusmixson wants to merge 4 commits intoSoju06:mainfrom
linusmixson wants to merge 4 commits intoSoju06:mainfrom
Conversation
Retry model fetches after transport failures by refreshing the shared HTTP client. - Add transport-error classification to model fetch failures. - Refresh the shared HTTP client once on transport errors, then retry the fetch with the same account. - Keep the retry path from cascading into repeated reconnect attempts. - Add unit coverage for transport error detection, HTTP client refresh replacement, and failover retry behavior. Agent: GPT-5.4; GPT-5.5
Agent: GPT-5.4-mini; GPT-5.5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recreating from #502, which appears to have gotten lost.
Summary
This PR makes model refresh recover from transport-level upstream failures by rebuilding the shared
aiohttpclient and retrying the fetch once.Handled failures now include compact error details, while unexpected exceptions still keep stack traces.
Observed Bug Evidence
Running codex-lb via
uvxin low-reliability network environments (VPN, laptop sleep/wake, etc.) often resulted in the software entering a state in which the upstream connection was lost and could not be regained without restarting the process. These present changes make codex-lb behave seamlessly under the same condtions.Root Cause
The model refresh path uses the shared
aiohttp.ClientSession.When the network changes underneath the process, the model fetch can fail before an HTTP response is received, surfacing as transport exceptions.
Previously, model refresh treated these the same as other model-fetch failures. It did not rebuild the shared HTTP client, so a bad/stale session or connector could continue to be reused after the network came back.
Fix
Model-fetch transport exceptions are now wrapped as ModelFetchError with transport_error=True.
When the scheduler sees the first transport error in a refresh cycle, it refreshes the shared HTTP client, closes the previous client, and retries the model fetch once.
HTTP status errors, auth failures, and invalid upstream responses stay on the existing paths.
The recovery logs now include the original transport error tersely, so failures are easier to diagnose without turning expected retry paths into stack traces.
Test Plan
uv run ruff check app/core/openai/model_refresh_scheduler.py tests/unit/test_model_refresh_scheduler.pyuv run pytest tests/unit/test_model_refresh_scheduler.pyuv run pytest tests/unit/test_http_client.pyuv run pytest tests/unit/test_model_refresh_scheduler.pygit diff --check