Skip to content

fix(proxy): mask single previous response misses#516

Open
Komzpa wants to merge 2 commits intoSoju06:mainfrom
Komzpa:codex/mask-single-previous-response-miss
Open

fix(proxy): mask single previous response misses#516
Komzpa wants to merge 2 commits intoSoju06:mainfrom
Komzpa:codex/mask-single-previous-response-miss

Conversation

@Komzpa
Copy link
Copy Markdown
Contributor

@Komzpa Komzpa commented Apr 29, 2026

Summary

  • mask single-request anonymous HTTP bridge previous_response_not_found as an internal continuity-loss signal instead of leaking the raw upstream 400
  • preserve the existing local previous-response rebind path by raising an internal bridge_previous_response_not_found marker only for the single anonymous HTTP bridge case
  • retry direct WebSocket full-resend follow-ups without a stale previous_response_id when upstream loses a just-completed anchor before response.created
  • sanitize direct WebSocket short continuations that hit upstream previous_response_not_found into stream_incomplete instead of replaying the same stale anchor and exposing the raw upstream 400
  • avoid false local no_plan_support_for_model 503s after partial model-registry refreshes: account selection now only applies the plan filter when the registry has an authoritative entry for the requested model, and model refresh invalidates cached selection inputs
  • mask any lower-layer public Responses or compact previous_response_not_found payload to retryable stream_incomplete so the missing response id cannot leak even if a lower recovery path misses it
  • add OpenSpec change artifacts plus HTTP bridge, WebSocket, and model-registry/load-balancer regression coverage

Root Cause

The HTTP bridge only rewrote upstream previous_response_not_found when the upstream event had a response.id or there were other pending requests. A single pending follow-up with an anonymous upstream error could therefore propagate the raw invalid-request error to Codex clients.

Live archive follow-up showed direct WebSocket Codex turns can receive response.completed for an anchor and then hit previous_response_not_found on the next turn seconds later on the same account/session. Full-resend turns can be recovered by reconnecting upstream and replaying the full payload without previous_response_id; short continuation turns cannot, because the only semantic payload is the stale anchor, so they now fail closed as a sanitized continuity-loss signal.

A DNS outage exposed a separate local-routing edge while the live container was running this PR stack: partial model registry data could make responses/compact apply a plan filter for a model the registry did not actually know yet, producing local 503 No accounts with a plan supporting model 'gpt-5.5' while adjacent responses calls on pro accounts were already succeeding.

Fresh live incident check

Checked the live raw leak with previous_response_id=resp_0bebc32c66e5a5990169f9d272ffbc8191801046ca030a311b from 2026-05-05:

  • 2026-05-05 11:20:19 UTC: upstream created resp_0beb...d272....
  • 2026-05-05 11:20:24 UTC: upstream returned response.completed for that response.
  • 2026-05-05 11:21:27 UTC: the same response stream recorded no close frame received or sent.
  • 2026-05-05 11:22:25 UTC: the next turn sent previous_response_id=resp_0beb...d272... and got raw previous_response_not_found / previous_response_id / 400 downstream.

This was not a cross-account routing miss: the created/completed response, follow-up request, and raw error all used the same account family c26b2ebf.... A new account fe2708b6... appeared nearby in the archive, but for a different parallel request.

So this PR's desired behavior is fail-closed masking: when upstream loses continuity even on the owner account, downstream gets 502 stream_incomplete, not raw previous_response_not_found or a resp_* id.

Validation

  • .venv/bin/pytest -q tests/integration/test_proxy_websocket_responses.py (37 passed, 2 pre-existing aiosqlite thread warnings)
  • .venv/bin/pytest -q tests/integration/test_proxy_websocket_responses.py::test_v1_responses_websocket_masks_short_previous_response_not_found_without_retry tests/integration/test_proxy_websocket_responses.py::test_backend_responses_websocket_masks_short_previous_response_not_found_without_retry tests/integration/test_proxy_websocket_responses.py::test_backend_responses_websocket_masks_previous_response_not_found_when_message_omits_response_id tests/integration/test_proxy_websocket_responses.py::test_v1_responses_websocket_retries_full_resend_previous_response_miss_without_anchor (4 passed)
  • .venv/bin/pytest -q tests/unit/test_proxy_load_balancer_refresh.py tests/unit/test_graceful_degradation.py (42 passed, 3 skipped)
  • .venv/bin/ruff check app/modules/proxy/service.py tests/integration/test_proxy_websocket_responses.py
  • .venv/bin/ruff check app/modules/proxy/load_balancer.py app/core/openai/model_refresh_scheduler.py tests/unit/test_proxy_load_balancer_refresh.py
  • .venv/bin/ruff format --check app/modules/proxy/service.py tests/integration/test_proxy_websocket_responses.py
  • .venv/bin/ruff format --check app/modules/proxy/load_balancer.py app/core/openai/model_refresh_scheduler.py tests/unit/test_proxy_load_balancer_refresh.py
  • npx --yes @fission-ai/openspec validate --specs (19 passed)
  • git diff --check
  • /home/kom/proj/codex-lb/.venv/bin/python -m pytest -q tests/unit/test_proxy_api_websocket_auth.py::test_public_previous_response_not_found_error_is_masked_to_stream_incomplete tests/unit/test_proxy_api_websocket_auth.py::test_public_previous_response_invalid_request_param_is_masked_to_stream_incomplete tests/integration/test_proxy_websocket_responses.py tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_rebinds_after_upstream_previous_response_not_found tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_masks_anonymous_previous_response_not_found_with_inflight_request (41 passed, rerun on 2026-05-05)
  • /home/kom/proj/codex-lb/.venv/bin/ruff check app/modules/proxy/service.py tests/unit/test_proxy_api_websocket_auth.py tests/integration/test_proxy_websocket_responses.py tests/integration/test_http_responses_bridge.py (rerun on 2026-05-05)
  • git diff --check (rerun on 2026-05-05)

GitHub CI for this PR head is green: 18/18 checks passed.

Notes

  • make -f src/Makefile precommit is not available in this repository; that Makefile belongs to the surrounding OpenClaw workspace.
  • .venv/bin/ty check app/modules/proxy/service.py tests/integration/test_proxy_websocket_responses.py still reports the pre-existing service.py:9018..9029 diagnostics that are handled by neighboring PR fix(types): clear existing ty diagnostics #517.

Related issues

@Komzpa Komzpa force-pushed the codex/mask-single-previous-response-miss branch from 9d9ebc2 to 1fe5fe6 Compare April 29, 2026 09:47
@Komzpa Komzpa marked this pull request as ready for review April 29, 2026 09:48
@Komzpa Komzpa force-pushed the codex/mask-single-previous-response-miss branch 4 times, most recently from 14449fe to f37065d Compare April 30, 2026 07:21
@mgwals
Copy link
Copy Markdown

mgwals commented May 1, 2026

I validated this branch against a fresh local incident capture from codex-cli 0.128.0 using a local codex-lb:1.15.0-ws-retry container.

Observed in request_logs over a 6h window:

  • 1634 successes, 3 errors
  • all 3 errors were status=error, error_code=stream_incomplete, transport=http
  • message: Upstream websocket closed before response.completed: no close frame received or sent
  • models/efforts: gpt-5.5 with high/xhigh
  • no raw account ids included here; account ids were only hashed in the local capture

This branch matches the mitigation I would want for the continuity-loss class: mask previous_response_not_found, retry only when the request is still safely replayable, and avoid blind retry once a semantic continuation depends only on a stale anchor.

Local validation on this PR branch:

  • uv run pytest -q tests/unit/test_proxy_api_websocket_auth.py::test_public_previous_response_not_found_error_is_masked_to_stream_incomplete tests/unit/test_proxy_api_websocket_auth.py::test_public_previous_response_invalid_request_param_is_masked_to_stream_incomplete tests/integration/test_proxy_websocket_responses.py::test_v1_responses_websocket_masks_short_previous_response_not_found_without_retry tests/integration/test_proxy_websocket_responses.py::test_backend_responses_websocket_masks_short_previous_response_not_found_without_retry tests/integration/test_proxy_websocket_responses.py::test_backend_responses_websocket_masks_previous_response_not_found_when_message_omits_response_id tests/integration/test_proxy_websocket_responses.py::test_v1_responses_websocket_retries_full_resend_previous_response_miss_without_anchor tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_rebinds_after_upstream_previous_response_not_found tests/integration/test_http_responses_bridge.py::test_v1_responses_http_bridge_masks_anonymous_previous_response_not_found_with_inflight_request -> 8 passed
  • uv run openspec validate --specs -> 19 passed, 0 failed
  • uv run ruff check ... on changed proxy/test files -> passed
  • uv run ruff format --check ... wanted to reformat app/modules/proxy/service.py locally with the current lockfile/tooling

Separate live smoke against the existing container, without deploying this PR: 60 small streamed /v1/responses requests (20 each for low/high/xhigh) all reached response.completed, and no new error rows appeared after the smoke start.

@Komzpa Komzpa force-pushed the codex/mask-single-previous-response-miss branch 2 times, most recently from 63aa033 to 54fe916 Compare May 3, 2026 13:52
@Komzpa Komzpa force-pushed the codex/mask-single-previous-response-miss branch from 54fe916 to 368e287 Compare May 3, 2026 13:57
@Komzpa
Copy link
Copy Markdown
Contributor Author

Komzpa commented May 5, 2026

Cross-link note: #535 is a narrower overlapping PR for direct WebSocket previous_response_not_found masking.

This PR (#516) is the broader owned branch and should be treated as the canonical fix for our live incident class: HTTP bridge, direct WebSocket, compact/public masking, and the model-registry edge are all covered here.

Live deployment note: the current local container image is codex-lb:weekly-recovery-safe-25770b8, built from the live composite branch at 25770b8. That stack includes this PR's proxy masking patch as commit 6e4cbc3; its stable patch-id matches this PR head 368e287, so the deployed proxy masking behavior corresponds to #516 even though the live commit hash differs due to the composite stack.

@jumski
Copy link
Copy Markdown

jumski commented May 8, 2026

Tested - resolves the 503 no_plan_support_for_model false positives with opencode + codex-lb.

To test locally:

git clone https://github.com/Komzpa/codex-lb.git /tmp/codex-lb-pr516
git -C /tmp/codex-lb-pr516 checkout codex/mask-single-previous-response-miss
DOCKER_BUILDKIT=1 docker build -t codex-lb:pr516 /tmp/codex-lb-pr516

Point compose.yaml at codex-lb:pr516 and restart.

Works well. Would love to see this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants