fix(x/audit): bootstrap-recovery exception when active set is empty#135
Merged
Conversation
When the epoch's anchored active set is empty (all supernodes POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction: with zero probers, no peer report exists that could attest all-ports-OPEN for any POSTPONED supernode. The chain cannot self-heal — every validator key holder must perform a manual deregister+re-register cycle out-of-band. Trigger on mainnet (live params at height 5,001,129): - consecutive_epochs_to_postpone = 1 (one missed epoch postpones) - 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin - Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in lockstep → active set can drop to 0 → permanent deadlock. Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, accept a compliant self host-report alone as sufficient for recovery. The bootstrap exception sits AFTER the storage-truth and action-finalization redirects (they keep their own recovery semantics) and AFTER selfHostCompliant (a misbehaving SN cannot self-recover via this branch). When no anchor exists for the epoch (test fixture or pre-anchor edge case), the branch is skipped and the legacy peer-port path runs unchanged. Test matrix (5 cells): - empty anchor + compliant self-report → recover - empty anchor + no self-report → no-recover (self-gate) - empty anchor + non-compliant self-report → no-recover (self-gate) - non-empty anchor + no peer obs → no-recover (legacy preserved) - no anchor → no-recover (legacy preserved) The pre-fix scenario test that asserted deadlock (TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is inverted to its new contract: recovery succeeds via the bootstrap exception when self-reports are compliant. Risk: LOW. Reads existing deterministic state (EpochAnchor) only. Branch is only reachable when no other recovery path applies and self-compliance has already been verified. No new external calls, no wall-clock dependency, no map iteration. Cosmos determinism preserved. Rollback: revert this commit. The legacy peer-port deadlock returns, recoverable via the documented deregister+re-register procedure (skill: lumera-supernode-postponed-recovery). Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED; gov proposal 33 to bypass via empty required_open_ports passed on-chain but silently no-op'd because Params.WithDefaults() re-fills the list. The deadlock is a real protocol-level design gap, not a devnet quirk.
Two system tests in audit_empty_active_set_bootstrap_test.go were
written to document the empty-active-set DEADLOCK as expected
behavior (one used legacy MsgReportSupernodeMetrics to break it, the
other asserted 3 consecutive host-only-report epochs never recover).
With the bootstrap-recovery exception in shouldRecoverAtEpochEnd
(this PR's main change), the deadlock no longer exists: compliant
self host-reports alone are sufficient to recover when the active
set is empty.
Invert both tests to the new contract:
1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover
(was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock
+ TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover)
Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after
submitting compliant host-only reports — no legacy metrics path
needed. The chain self-heals.
2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW)
Guards the self-compliance gate. With MinDiskFreePercent=20, a
POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST
remain POSTPONED even though the active set is empty. This blocks
the exception from becoming a 'free pass' for misbehaving SNs and
complements the unit-level violation tests in
x/audit/v1/keeper/enforcement_empty_active_set_test.go.
Helpers added in audit_test_helpers_test.go:
- auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent.
- setAuditParamsForFastEpochsWithMinDiskFree: lets a test override
MinDiskFreePercent in genesis.
Found by: PR #135 CI system-test failure (the previous tests asserted
the pre-fix deadlock contract). The original assertions are now
covered by historical context in commit messages and the skill
lumera-supernode-postponed-recovery.
j-rafique
approved these changes
May 11, 2026
mateeullahmalik
added a commit
that referenced
this pull request
May 11, 2026
Master added a full-scan golangci-lint hard gate in lint.yml (PR #135 era). That supersedes this PR's original premise of being the *only* lint gate. Reworked into an additive PR-only layer: - Drop the lint.yml/.golangci.yml overrides; master's full-scan stays the authoritative correctness gate. - Add .github/workflows/lint-pr.yml carrying: * reviewdog-wrapped diff-only golangci-lint (filter_mode: added) so findings inside the PR diff land as inline review comments rather than a single check-run log scrape. * govulncheck job as a separate security gate. Copilot review applied to the new file: - No `level: warning` next to `fail_level: error` — the two together silently downgrade every finding and bypass the gate. Drop `level` so each diagnostic's native severity is preserved. - Split reviewdog step on `head.repo.full_name == github.repository`: same-repo PRs use `github-pr-review` (inline comments), fork PRs use `reporter: local` (log-only) because fork PR tokens are strictly read-only and neither pr-review nor pr-check can post results. - govulncheck-action: keep `repo-checkout: false` from the previous commit (duplicate-Authorization-header fix).
mateeullahmalik
added a commit
that referenced
this pull request
May 11, 2026
…135) (#137) * fix(x/audit): bootstrap-recovery exception when active set is empty When the epoch's anchored active set is empty (all supernodes POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction: with zero probers, no peer report exists that could attest all-ports-OPEN for any POSTPONED supernode. The chain cannot self-heal — every validator key holder must perform a manual deregister+re-register cycle out-of-band. Trigger on mainnet (live params at height 5,001,129): - consecutive_epochs_to_postpone = 1 (one missed epoch postpones) - 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin - Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in lockstep → active set can drop to 0 → permanent deadlock. Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, accept a compliant self host-report alone as sufficient for recovery. The bootstrap exception sits AFTER the storage-truth and action-finalization redirects (they keep their own recovery semantics) and AFTER selfHostCompliant (a misbehaving SN cannot self-recover via this branch). When no anchor exists for the epoch (test fixture or pre-anchor edge case), the branch is skipped and the legacy peer-port path runs unchanged. Test matrix (5 cells): - empty anchor + compliant self-report → recover - empty anchor + no self-report → no-recover (self-gate) - empty anchor + non-compliant self-report → no-recover (self-gate) - non-empty anchor + no peer obs → no-recover (legacy preserved) - no anchor → no-recover (legacy preserved) The pre-fix scenario test that asserted deadlock (TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is inverted to its new contract: recovery succeeds via the bootstrap exception when self-reports are compliant. Risk: LOW. Reads existing deterministic state (EpochAnchor) only. Branch is only reachable when no other recovery path applies and self-compliance has already been verified. No new external calls, no wall-clock dependency, no map iteration. Cosmos determinism preserved. Rollback: revert this commit. The legacy peer-port deadlock returns, recoverable via the documented deregister+re-register procedure (skill: lumera-supernode-postponed-recovery). Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED; gov proposal 33 to bypass via empty required_open_ports passed on-chain but silently no-op'd because Params.WithDefaults() re-fills the list. The deadlock is a real protocol-level design gap, not a devnet quirk. * test(systemtests): invert empty-active-set tests for bootstrap exception Two system tests in audit_empty_active_set_bootstrap_test.go were written to document the empty-active-set DEADLOCK as expected behavior (one used legacy MsgReportSupernodeMetrics to break it, the other asserted 3 consecutive host-only-report epochs never recover). With the bootstrap-recovery exception in shouldRecoverAtEpochEnd (this PR's main change), the deadlock no longer exists: compliant self host-reports alone are sufficient to recover when the active set is empty. Invert both tests to the new contract: 1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover) Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after submitting compliant host-only reports — no legacy metrics path needed. The chain self-heals. 2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW) Guards the self-compliance gate. With MinDiskFreePercent=20, a POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST remain POSTPONED even though the active set is empty. This blocks the exception from becoming a 'free pass' for misbehaving SNs and complements the unit-level violation tests in x/audit/v1/keeper/enforcement_empty_active_set_test.go. Helpers added in audit_test_helpers_test.go: - auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent. - setAuditParamsForFastEpochsWithMinDiskFree: lets a test override MinDiskFreePercent in genesis. Found by: PR #135 CI system-test failure (the previous tests asserted the pre-fix deadlock contract). The original assertions are now covered by historical context in commit messages and the skill lumera-supernode-postponed-recovery. (cherry picked from commit d0e181d)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the empty active set deadlock in the audit module: when all supernodes are simultaneously POSTPONED, the existing peer-port recovery rule in
shouldRecoverAtEpochEndbecomes unsatisfiable by construction (zero probers ⇒ zero peer reports ⇒ no SN can recover). The chain cannot self-heal and recovery requires every validator key holder to perform a manualderegister + re-registercycle out-of-band — a distributed coordination problem on mainnet.This PR adds a bootstrap-recovery exception: when the epoch's anchored active set is empty AND the POSTPONED SN submits a compliant self host-report, the SN recovers to ACTIVE.
Why this matters (live mainnet evidence)
Queried
lumera.audit.v1.Query/Paramsat mainnet height 5,001,129:Current state: 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin.
Realistic upgrade-day trigger: an upgrade-halt window plus rolling restart lag of >40 min for a subset of SNs ⇒ those SNs miss one epoch ⇒ POSTPONED in lockstep. If the surviving active set drops to zero, the deadlock locks in permanently.
Two governance escape hatches that LOOK viable but DON'T WORK were verified on devnet 2026-05-08 (gov proposal 33 PASSED on-chain but silently no-op'd):
MsgUpdateParamswithrequired_open_ports=[]is silently re-filled to defaults byParams.WithDefaults()(x/audit/v1/types/params.go:406-407)Params.Validate()line 619 rejects empty listThe deadlock is a real protocol-level design gap, not a devnet quirk.
What changed
x/audit/v1/keeper/enforcement.go::shouldRecoverAtEpochEnd— adds an 11-line branch that fires whenGetEpochAnchor(epochID).ActiveSupernodeAccountsis empty, after the existing storage-truth / action-finalization redirects and afterselfHostCompliant. A misbehaving SN cannot self-recover via this branch — self-compliance still gates it.Invariants preserved
EpochAnchor) frozen at epoch startshouldRecoverFromStorageTruthPostponementruns first, bootstrap branch NOT takenshouldRecoverFromActionFinalizationPostponementruns first, bootstrap branch NOT takenTest matrix (5 cells, all PASS)
x/audit/v1/keeper/enforcement_empty_active_set_test.go:TestEnforceEpochEnd_EmptyActiveSet_PostponedRecoversViaBootstrapExceptionTestEnforceEpochEnd_EmptyActiveSet_NoSelfReport_NoRecoverTestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecoverTestEnforceEpochEnd_NonEmptyActiveSet_NoPeerObs_NoRecoverTestEnforceEpochEnd_NoEpochAnchor_FallsThroughToLegacyPathThe pre-existing test
TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecoverthat asserted the deadlock is inverted to the new contract (Times(2)recovery instead ofTimes(0)).Risk
LOW. The bootstrap branch is reachable only when:
In any normal operating state (≥1 ACTIVE), the branch is skipped and the legacy peer-port path runs unchanged. Determinism preserved.
Rollback
git revert <commit>. The legacy peer-port deadlock returns; recovery still possible via the documentedderegister + re-registerprocedure (internal runbook:lumera-supernode-postponed-recovery).Migration / upgrade notes
None. Pure logic change in
EnforceEpochEnd's recovery branch. No state-key changes, no proto changes, no consensus version bump. Safe to ship as a minor-version logic patch (e.g. v1.20.x). Activates the moment the upgraded binary runs — no proposal-side toggle required.IBC / interop
N/A — audit module is not IBC-bound.
Devnet rehearsal
Unit-level test matrix covers the bootstrap branch + the legacy-preservation cases with mocked supernode keeper. Live devnet rehearsal (force all SNs POSTPONED, wait for next epoch end, verify auto-recovery without manual deregister/register) is planned as part of the v1.20 upgrade-rehearsal gate.
Follow-ups (not in this PR)
RequiredOpenPortsvia a sentinel value. Currently blocked byParams.WithDefaultssilently re-filling the list. With PR-A landed, this is less urgent.ConsecutiveEpochsToPostponedefault from 1 to 3 (or enforce a floor inParams.Validate). Reduces the probability of triggering the empty-set condition. Can be folded into the v1.12.0 upgrade handler (proposal not yet on chain) to apply at activation without a separate gov proposal.Files changed
x/audit/v1/keeper/enforcement.go— +11 lines, bootstrap branchx/audit/v1/keeper/enforcement_empty_active_set_test.go— test matrix