Skip to content

fix: reset circuit breaker failures after recovery#100

Open
shikhar wants to merge 1 commit intomainfrom
codex/fix-issue-97-circuit-breaker-recovery
Open

fix: reset circuit breaker failures after recovery#100
shikhar wants to merge 1 commit intomainfrom
codex/fix-issue-97-circuit-breaker-recovery

Conversation

@shikhar
Copy link
Copy Markdown
Member

@shikhar shikhar commented Mar 27, 2026

Summary

  • reset stale consecutive failures only when an open circuit breaker has fully recovered
  • add regression coverage to ensure one failure after recovery does not immediately reopen the circuit

Testing

  • cargo +nightly fmt
  • cargo nextest run
  • cargo clippy --all-features --all-targets -- -D warnings --allow deprecated

Closes #97

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 27, 2026

Greptile Summary

This PR fixes a circuit breaker bug (#97) where consecutive_failures was never cleared after the recovery window elapsed, causing a single new failure to immediately reopen the circuit. The fix mutates consecutive_failures = 0 inside is_circuit_open when it detects that RECOVERY_TIME has passed, so the counter only starts accumulating fresh failures from that point forward. A regression test validates that CONSECUTIVE_FAILURE_THRESHOLD new failures are required to trip the circuit again after recovery.\n\nKey changes:\n- is_circuit_open promoted from &self to &mut self, and now resets consecutive_failures to 0 on the recovered path.\n- metrics() already reads consecutive_failures after calling is_circuit_open, so exported metrics and the BucketMetrics.consecutive_failures field automatically reflect the reset value — no additional ordering change needed.\n- New tokio::test exercises the full trip → wait → recover → single failure → full re-trip cycle with tokio::time::pause for deterministic time control.

Confidence Score: 5/5

Safe to merge — the fix is minimal, logically sound, and covered by a deterministic regression test.

The change correctly addresses the described bug with no new issues introduced. The only nuance worth knowing (not a blocker) is that the reset is lazy: it fires the first time is_circuit_open is evaluated after recovery, not eagerly when the window expires. In the normal request flow (attempt_orderscoreobserve) this is fine because score runs before any new observe, so the reset always precedes the next failure count. No P0 or P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
src/object_store/stats.rs Correct targeted fix: is_circuit_open now resets consecutive_failures to 0 on the recovered path, with a well-structured regression test validating the full recovery lifecycle.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["is_circuit_open(now)"] --> B{consecutive_failures < THRESHOLD?}
    B -- yes --> C[return false - circuit closed]
    B -- no --> D{duration_since last_failure < RECOVERY_TIME?}
    D -- yes --> E[return true - circuit open]
    D -- no --> F[consecutive_failures = 0 - reset stale count]
    F --> G[return false - circuit recovered]
Loading

Reviews (1): Last reviewed commit: "fix: reset circuit breaker failures afte..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Detail Bug] Object store: circuit breaker reopens after recovery on a single failure due to non-reset consecutive failure counter

1 participant