fix(x/audit): bootstrap-recovery exception when active set is empty by mateeullahmalik · Pull Request #135 · LumeraProtocol/lumera

mateeullahmalik · 2026-05-11T08:58:35Z

Summary

Fixes the empty active set deadlock in the audit module: when all supernodes are simultaneously POSTPONED, the existing peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction (zero probers ⇒ zero peer reports ⇒ no SN can recover). The chain cannot self-heal and recovery requires every validator key holder to perform a manual deregister + re-register cycle out-of-band — a distributed coordination problem on mainnet.

This PR adds a bootstrap-recovery exception: when the epoch's anchored active set is empty AND the POSTPONED SN submits a compliant self host-report, the SN recovers to ACTIVE.

Why this matters (live mainnet evidence)

Queried lumera.audit.v1.Query/Params at mainnet height 5,001,129:

required_open_ports:            [4444, 4445, 8002]
consecutive_epochs_to_postpone: 1     ← one missed epoch postpones
peer_port_postpone_threshold_percent: 100
epoch_length_blocks: 400              (~40 min epochs)

Current state: 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin.

Realistic upgrade-day trigger: an upgrade-halt window plus rolling restart lag of >40 min for a subset of SNs ⇒ those SNs miss one epoch ⇒ POSTPONED in lockstep. If the surviving active set drops to zero, the deadlock locks in permanently.

Two governance escape hatches that LOOK viable but DON'T WORK were verified on devnet 2026-05-08 (gov proposal 33 PASSED on-chain but silently no-op'd):

MsgUpdateParams with required_open_ports=[] is silently re-filled to defaults by Params.WithDefaults() (x/audit/v1/types/params.go:406-407)
Even if step 1 were bypassed, Params.Validate() line 619 rejects empty list

The deadlock is a real protocol-level design gap, not a devnet quirk.

What changed

x/audit/v1/keeper/enforcement.go::shouldRecoverAtEpochEnd — adds an 11-line branch that fires when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, after the existing storage-truth / action-finalization redirects and after selfHostCompliant. A misbehaving SN cannot self-recover via this branch — self-compliance still gates it.

Invariants preserved

Invariant	Status
Determinism	✓ Reads existing deterministic state (`EpochAnchor`) frozen at epoch start
Cosmos replay-safety	✓ No new external calls, no wall-clock, no map iteration
Storage-truth recovery semantics	✓ Redirect to `shouldRecoverFromStorageTruthPostponement` runs first, bootstrap branch NOT taken
Action-finalization recovery semantics	✓ Redirect to `shouldRecoverFromActionFinalizationPostponement` runs first, bootstrap branch NOT taken
Self-compliance gate	✓ Still mandatory before the bootstrap branch
Legacy peer-port recovery (non-empty anchor)	✓ Unchanged when anchor has ≥1 active account

Test matrix (5 cells, all PASS)

x/audit/v1/keeper/enforcement_empty_active_set_test.go:

Anchor active count	Self-report	Self-compliant	Peer all-open	Expected	Test
0	yes	yes	n/a	recover ✓	`TestEnforceEpochEnd_EmptyActiveSet_PostponedRecoversViaBootstrapException`
0	no	n/a	n/a	NO recover	`TestEnforceEpochEnd_EmptyActiveSet_NoSelfReport_NoRecover`
0	yes	no (disk > min)	n/a	NO recover	`TestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecover`
≥1	yes	yes	no	NO recover (legacy)	`TestEnforceEpochEnd_NonEmptyActiveSet_NoPeerObs_NoRecover`
no anchor	yes	yes	n/a	NO recover (legacy)	`TestEnforceEpochEnd_NoEpochAnchor_FallsThroughToLegacyPath`

The pre-existing test TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover that asserted the deadlock is inverted to the new contract (Times(2) recovery instead of Times(0)).

Risk

LOW. The bootstrap branch is reachable only when:

No storage-truth postponement at this epoch, AND
No action-finalization postponement at this epoch, AND
SN submitted a compliant self host-report, AND
Epoch anchor exists AND active set is empty

In any normal operating state (≥1 ACTIVE), the branch is skipped and the legacy peer-port path runs unchanged. Determinism preserved.

Rollback

git revert <commit>. The legacy peer-port deadlock returns; recovery still possible via the documented deregister + re-register procedure (internal runbook: lumera-supernode-postponed-recovery).

Migration / upgrade notes

None. Pure logic change in EnforceEpochEnd's recovery branch. No state-key changes, no proto changes, no consensus version bump. Safe to ship as a minor-version logic patch (e.g. v1.20.x). Activates the moment the upgraded binary runs — no proposal-side toggle required.

IBC / interop

N/A — audit module is not IBC-bound.

Devnet rehearsal

Unit-level test matrix covers the bootstrap branch + the legacy-preservation cases with mocked supernode keeper. Live devnet rehearsal (force all SNs POSTPONED, wait for next epoch end, verify auto-recovery without manual deregister/register) is planned as part of the v1.20 upgrade-rehearsal gate.

Follow-ups (not in this PR)

PR-B (deferred): allow governance to clear RequiredOpenPorts via a sentinel value. Currently blocked by Params.WithDefaults silently re-filling the list. With PR-A landed, this is less urgent.
PR-C (deferred): bump ConsecutiveEpochsToPostpone default from 1 to 3 (or enforce a floor in Params.Validate). Reduces the probability of triggering the empty-set condition. Can be folded into the v1.12.0 upgrade handler (proposal not yet on chain) to apply at activation without a separate gov proposal.

Files changed

x/audit/v1/keeper/enforcement.go — +11 lines, bootstrap branch
x/audit/v1/keeper/enforcement_empty_active_set_test.go — test matrix

When the epoch's anchored active set is empty (all supernodes POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction: with zero probers, no peer report exists that could attest all-ports-OPEN for any POSTPONED supernode. The chain cannot self-heal — every validator key holder must perform a manual deregister+re-register cycle out-of-band. Trigger on mainnet (live params at height 5,001,129): - consecutive_epochs_to_postpone = 1 (one missed epoch postpones) - 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin - Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in lockstep → active set can drop to 0 → permanent deadlock. Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, accept a compliant self host-report alone as sufficient for recovery. The bootstrap exception sits AFTER the storage-truth and action-finalization redirects (they keep their own recovery semantics) and AFTER selfHostCompliant (a misbehaving SN cannot self-recover via this branch). When no anchor exists for the epoch (test fixture or pre-anchor edge case), the branch is skipped and the legacy peer-port path runs unchanged. Test matrix (5 cells): - empty anchor + compliant self-report → recover - empty anchor + no self-report → no-recover (self-gate) - empty anchor + non-compliant self-report → no-recover (self-gate) - non-empty anchor + no peer obs → no-recover (legacy preserved) - no anchor → no-recover (legacy preserved) The pre-fix scenario test that asserted deadlock (TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is inverted to its new contract: recovery succeeds via the bootstrap exception when self-reports are compliant. Risk: LOW. Reads existing deterministic state (EpochAnchor) only. Branch is only reachable when no other recovery path applies and self-compliance has already been verified. No new external calls, no wall-clock dependency, no map iteration. Cosmos determinism preserved. Rollback: revert this commit. The legacy peer-port deadlock returns, recoverable via the documented deregister+re-register procedure (skill: lumera-supernode-postponed-recovery). Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED; gov proposal 33 to bypass via empty required_open_ports passed on-chain but silently no-op'd because Params.WithDefaults() re-fills the list. The deadlock is a real protocol-level design gap, not a devnet quirk.

Two system tests in audit_empty_active_set_bootstrap_test.go were written to document the empty-active-set DEADLOCK as expected behavior (one used legacy MsgReportSupernodeMetrics to break it, the other asserted 3 consecutive host-only-report epochs never recover). With the bootstrap-recovery exception in shouldRecoverAtEpochEnd (this PR's main change), the deadlock no longer exists: compliant self host-reports alone are sufficient to recover when the active set is empty. Invert both tests to the new contract: 1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover) Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after submitting compliant host-only reports — no legacy metrics path needed. The chain self-heals. 2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW) Guards the self-compliance gate. With MinDiskFreePercent=20, a POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST remain POSTPONED even though the active set is empty. This blocks the exception from becoming a 'free pass' for misbehaving SNs and complements the unit-level violation tests in x/audit/v1/keeper/enforcement_empty_active_set_test.go. Helpers added in audit_test_helpers_test.go: - auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent. - setAuditParamsForFastEpochsWithMinDiskFree: lets a test override MinDiskFreePercent in genesis. Found by: PR #135 CI system-test failure (the previous tests asserted the pre-fix deadlock contract). The original assertions are now covered by historical context in commit messages and the skill lumera-supernode-postponed-recovery.

Master added a full-scan golangci-lint hard gate in lint.yml (PR #135 era). That supersedes this PR's original premise of being the *only* lint gate. Reworked into an additive PR-only layer: - Drop the lint.yml/.golangci.yml overrides; master's full-scan stays the authoritative correctness gate. - Add .github/workflows/lint-pr.yml carrying: * reviewdog-wrapped diff-only golangci-lint (filter_mode: added) so findings inside the PR diff land as inline review comments rather than a single check-run log scrape. * govulncheck job as a separate security gate. Copilot review applied to the new file: - No `level: warning` next to `fail_level: error` — the two together silently downgrade every finding and bypass the gate. Drop `level` so each diagnostic's native severity is preserved. - Split reviewdog step on `head.repo.full_name == github.repository`: same-repo PRs use `github-pr-review` (inline comments), fork PRs use `reporter: local` (log-only) because fork PR tokens are strictly read-only and neither pr-review nor pr-check can post results. - govulncheck-action: keep `repo-checkout: false` from the previous commit (duplicate-Authorization-header fix).

…135) (#137) * fix(x/audit): bootstrap-recovery exception when active set is empty When the epoch's anchored active set is empty (all supernodes POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction: with zero probers, no peer report exists that could attest all-ports-OPEN for any POSTPONED supernode. The chain cannot self-heal — every validator key holder must perform a manual deregister+re-register cycle out-of-band. Trigger on mainnet (live params at height 5,001,129): - consecutive_epochs_to_postpone = 1 (one missed epoch postpones) - 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin - Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in lockstep → active set can drop to 0 → permanent deadlock. Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, accept a compliant self host-report alone as sufficient for recovery. The bootstrap exception sits AFTER the storage-truth and action-finalization redirects (they keep their own recovery semantics) and AFTER selfHostCompliant (a misbehaving SN cannot self-recover via this branch). When no anchor exists for the epoch (test fixture or pre-anchor edge case), the branch is skipped and the legacy peer-port path runs unchanged. Test matrix (5 cells): - empty anchor + compliant self-report → recover - empty anchor + no self-report → no-recover (self-gate) - empty anchor + non-compliant self-report → no-recover (self-gate) - non-empty anchor + no peer obs → no-recover (legacy preserved) - no anchor → no-recover (legacy preserved) The pre-fix scenario test that asserted deadlock (TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is inverted to its new contract: recovery succeeds via the bootstrap exception when self-reports are compliant. Risk: LOW. Reads existing deterministic state (EpochAnchor) only. Branch is only reachable when no other recovery path applies and self-compliance has already been verified. No new external calls, no wall-clock dependency, no map iteration. Cosmos determinism preserved. Rollback: revert this commit. The legacy peer-port deadlock returns, recoverable via the documented deregister+re-register procedure (skill: lumera-supernode-postponed-recovery). Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED; gov proposal 33 to bypass via empty required_open_ports passed on-chain but silently no-op'd because Params.WithDefaults() re-fills the list. The deadlock is a real protocol-level design gap, not a devnet quirk. * test(systemtests): invert empty-active-set tests for bootstrap exception Two system tests in audit_empty_active_set_bootstrap_test.go were written to document the empty-active-set DEADLOCK as expected behavior (one used legacy MsgReportSupernodeMetrics to break it, the other asserted 3 consecutive host-only-report epochs never recover). With the bootstrap-recovery exception in shouldRecoverAtEpochEnd (this PR's main change), the deadlock no longer exists: compliant self host-reports alone are sufficient to recover when the active set is empty. Invert both tests to the new contract: 1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover) Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after submitting compliant host-only reports — no legacy metrics path needed. The chain self-heals. 2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW) Guards the self-compliance gate. With MinDiskFreePercent=20, a POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST remain POSTPONED even though the active set is empty. This blocks the exception from becoming a 'free pass' for misbehaving SNs and complements the unit-level violation tests in x/audit/v1/keeper/enforcement_empty_active_set_test.go. Helpers added in audit_test_helpers_test.go: - auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent. - setAuditParamsForFastEpochsWithMinDiskFree: lets a test override MinDiskFreePercent in genesis. Found by: PR #135 CI system-test failure (the previous tests asserted the pre-fix deadlock contract). The original assertions are now covered by historical context in commit messages and the skill lumera-supernode-postponed-recovery. (cherry picked from commit d0e181d)

mateeullahmalik added 2 commits May 11, 2026 08:57

mateeullahmalik self-assigned this May 11, 2026

j-rafique approved these changes May 11, 2026

View reviewed changes

mateeullahmalik merged commit d0e181d into master May 11, 2026
18 checks passed

mateeullahmalik mentioned this pull request May 11, 2026

[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(x/audit): bootstrap-recovery exception when active set is empty#135

fix(x/audit): bootstrap-recovery exception when active set is empty#135
mateeullahmalik merged 2 commits into
masterfrom
fix/audit-bootstrap-recovery

mateeullahmalik commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mateeullahmalik commented May 11, 2026

Summary

Why this matters (live mainnet evidence)

What changed

Invariants preserved

Test matrix (5 cells, all PASS)

Risk

Rollback

Migration / upgrade notes

IBC / interop

Devnet rehearsal

Follow-ups (not in this PR)

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants