fix(proxy): mask single previous response misses#516
fix(proxy): mask single previous response misses#516Komzpa wants to merge 2 commits intoSoju06:mainfrom
Conversation
9d9ebc2 to
1fe5fe6
Compare
14449fe to
f37065d
Compare
|
I validated this branch against a fresh local incident capture from Observed in
This branch matches the mitigation I would want for the continuity-loss class: mask Local validation on this PR branch:
Separate live smoke against the existing container, without deploying this PR: 60 small streamed |
63aa033 to
54fe916
Compare
54fe916 to
368e287
Compare
|
Cross-link note: #535 is a narrower overlapping PR for direct WebSocket This PR (#516) is the broader owned branch and should be treated as the canonical fix for our live incident class: HTTP bridge, direct WebSocket, compact/public masking, and the model-registry edge are all covered here. Live deployment note: the current local container image is |
|
Tested - resolves the To test locally: git clone https://github.com/Komzpa/codex-lb.git /tmp/codex-lb-pr516
git -C /tmp/codex-lb-pr516 checkout codex/mask-single-previous-response-miss
DOCKER_BUILDKIT=1 docker build -t codex-lb:pr516 /tmp/codex-lb-pr516Point compose.yaml at Works well. Would love to see this merged. |
Summary
previous_response_not_foundas an internal continuity-loss signal instead of leaking the raw upstream 400bridge_previous_response_not_foundmarker only for the single anonymous HTTP bridge caseprevious_response_idwhen upstream loses a just-completed anchor beforeresponse.createdprevious_response_not_foundintostream_incompleteinstead of replaying the same stale anchor and exposing the raw upstream 400no_plan_support_for_model503s after partial model-registry refreshes: account selection now only applies the plan filter when the registry has an authoritative entry for the requested model, and model refresh invalidates cached selection inputsprevious_response_not_foundpayload to retryablestream_incompleteso the missing response id cannot leak even if a lower recovery path misses itRoot Cause
The HTTP bridge only rewrote upstream
previous_response_not_foundwhen the upstream event had aresponse.idor there were other pending requests. A single pending follow-up with an anonymous upstream error could therefore propagate the raw invalid-request error to Codex clients.Live archive follow-up showed direct WebSocket Codex turns can receive
response.completedfor an anchor and then hitprevious_response_not_foundon the next turn seconds later on the same account/session. Full-resend turns can be recovered by reconnecting upstream and replaying the full payload withoutprevious_response_id; short continuation turns cannot, because the only semantic payload is the stale anchor, so they now fail closed as a sanitized continuity-loss signal.A DNS outage exposed a separate local-routing edge while the live container was running this PR stack: partial model registry data could make
responses/compactapply a plan filter for a model the registry did not actually know yet, producing local 503No accounts with a plan supporting model 'gpt-5.5'while adjacentresponsescalls on pro accounts were already succeeding.Fresh live incident check
Checked the live raw leak with
previous_response_id=resp_0bebc32c66e5a5990169f9d272ffbc8191801046ca030a311bfrom2026-05-05:2026-05-05 11:20:19 UTC: upstream createdresp_0beb...d272....2026-05-05 11:20:24 UTC: upstream returnedresponse.completedfor that response.2026-05-05 11:21:27 UTC: the same response stream recordedno close frame received or sent.2026-05-05 11:22:25 UTC: the next turn sentprevious_response_id=resp_0beb...d272...and got rawprevious_response_not_found/previous_response_id/400downstream.This was not a cross-account routing miss: the created/completed response, follow-up request, and raw error all used the same account family
c26b2ebf.... A new accountfe2708b6...appeared nearby in the archive, but for a different parallel request.So this PR's desired behavior is fail-closed masking: when upstream loses continuity even on the owner account, downstream gets
502 stream_incomplete, not rawprevious_response_not_foundor aresp_*id.Validation
.venv/bin/pytest -q tests/integration/test_proxy_websocket_responses.py(37 passed, 2 pre-existing aiosqlite thread warnings).venv/bin/pytest -q tests/integration/test_proxy_websocket_responses.py::test_v1_responses_websocket_masks_short_previous_response_not_found_without_retry tests/integration/test_proxy_websocket_responses.py::test_backend_responses_websocket_masks_short_previous_response_not_found_without_retry tests/integration/test_proxy_websocket_responses.py::test_backend_responses_websocket_masks_previous_response_not_found_when_message_omits_response_id tests/integration/test_proxy_websocket_responses.py::test_v1_responses_websocket_retries_full_resend_previous_response_miss_without_anchor(4 passed).venv/bin/pytest -q tests/unit/test_proxy_load_balancer_refresh.py tests/unit/test_graceful_degradation.py(42 passed, 3 skipped).venv/bin/ruff check app/modules/proxy/service.py tests/integration/test_proxy_websocket_responses.py.venv/bin/ruff check app/modules/proxy/load_balancer.py app/core/openai/model_refresh_scheduler.py tests/unit/test_proxy_load_balancer_refresh.py.venv/bin/ruff format --check app/modules/proxy/service.py tests/integration/test_proxy_websocket_responses.py.venv/bin/ruff format --check app/modules/proxy/load_balancer.py app/core/openai/model_refresh_scheduler.py tests/unit/test_proxy_load_balancer_refresh.pynpx --yes @fission-ai/openspec validate --specs(19 passed)git diff --check/home/kom/proj/codex-lb/.venv/bin/python -m pytest -q tests/unit/test_proxy_api_websocket_auth.py::test_public_previous_response_not_found_error_is_masked_to_stream_incomplete tests/unit/test_proxy_api_websocket_auth.py::test_public_previous_response_invalid_request_param_is_masked_to_stream_incomplete tests/integration/test_proxy_websocket_responses.py tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_rebinds_after_upstream_previous_response_not_found tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_masks_anonymous_previous_response_not_found_with_inflight_request(41 passed, rerun on2026-05-05)/home/kom/proj/codex-lb/.venv/bin/ruff check app/modules/proxy/service.py tests/unit/test_proxy_api_websocket_auth.py tests/integration/test_proxy_websocket_responses.py tests/integration/test_http_responses_bridge.py(rerun on2026-05-05)git diff --check(rerun on2026-05-05)GitHub CI for this PR head is green: 18/18 checks passed.
Notes
make -f src/Makefile precommitis not available in this repository; that Makefile belongs to the surrounding OpenClaw workspace..venv/bin/ty check app/modules/proxy/service.py tests/integration/test_proxy_websocket_responses.pystill reports the pre-existingservice.py:9018..9029diagnostics that are handled by neighboring PR fix(types): clear existing ty diagnostics #517.Related issues