Skip to content

fix(x/audit): bootstrap-recovery exception when active set is empty#135

Merged
mateeullahmalik merged 2 commits into
masterfrom
fix/audit-bootstrap-recovery
May 11, 2026
Merged

fix(x/audit): bootstrap-recovery exception when active set is empty#135
mateeullahmalik merged 2 commits into
masterfrom
fix/audit-bootstrap-recovery

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Contributor

Summary

Fixes the empty active set deadlock in the audit module: when all supernodes are simultaneously POSTPONED, the existing peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction (zero probers ⇒ zero peer reports ⇒ no SN can recover). The chain cannot self-heal and recovery requires every validator key holder to perform a manual deregister + re-register cycle out-of-band — a distributed coordination problem on mainnet.

This PR adds a bootstrap-recovery exception: when the epoch's anchored active set is empty AND the POSTPONED SN submits a compliant self host-report, the SN recovers to ACTIVE.

Why this matters (live mainnet evidence)

Queried lumera.audit.v1.Query/Params at mainnet height 5,001,129:

required_open_ports:            [4444, 4445, 8002]
consecutive_epochs_to_postpone: 1     ← one missed epoch postpones
peer_port_postpone_threshold_percent: 100
epoch_length_blocks: 400              (~40 min epochs)

Current state: 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin.

Realistic upgrade-day trigger: an upgrade-halt window plus rolling restart lag of >40 min for a subset of SNs ⇒ those SNs miss one epoch ⇒ POSTPONED in lockstep. If the surviving active set drops to zero, the deadlock locks in permanently.

Two governance escape hatches that LOOK viable but DON'T WORK were verified on devnet 2026-05-08 (gov proposal 33 PASSED on-chain but silently no-op'd):

  • MsgUpdateParams with required_open_ports=[] is silently re-filled to defaults by Params.WithDefaults() (x/audit/v1/types/params.go:406-407)
  • Even if step 1 were bypassed, Params.Validate() line 619 rejects empty list

The deadlock is a real protocol-level design gap, not a devnet quirk.

What changed

x/audit/v1/keeper/enforcement.go::shouldRecoverAtEpochEnd — adds an 11-line branch that fires when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, after the existing storage-truth / action-finalization redirects and after selfHostCompliant. A misbehaving SN cannot self-recover via this branch — self-compliance still gates it.

Invariants preserved

Invariant Status
Determinism ✓ Reads existing deterministic state (EpochAnchor) frozen at epoch start
Cosmos replay-safety ✓ No new external calls, no wall-clock, no map iteration
Storage-truth recovery semantics ✓ Redirect to shouldRecoverFromStorageTruthPostponement runs first, bootstrap branch NOT taken
Action-finalization recovery semantics ✓ Redirect to shouldRecoverFromActionFinalizationPostponement runs first, bootstrap branch NOT taken
Self-compliance gate ✓ Still mandatory before the bootstrap branch
Legacy peer-port recovery (non-empty anchor) ✓ Unchanged when anchor has ≥1 active account

Test matrix (5 cells, all PASS)

x/audit/v1/keeper/enforcement_empty_active_set_test.go:

Anchor active count Self-report Self-compliant Peer all-open Expected Test
0 yes yes n/a recover ✓ TestEnforceEpochEnd_EmptyActiveSet_PostponedRecoversViaBootstrapException
0 no n/a n/a NO recover TestEnforceEpochEnd_EmptyActiveSet_NoSelfReport_NoRecover
0 yes no (disk > min) n/a NO recover TestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecover
≥1 yes yes no NO recover (legacy) TestEnforceEpochEnd_NonEmptyActiveSet_NoPeerObs_NoRecover
no anchor yes yes n/a NO recover (legacy) TestEnforceEpochEnd_NoEpochAnchor_FallsThroughToLegacyPath

The pre-existing test TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover that asserted the deadlock is inverted to the new contract (Times(2) recovery instead of Times(0)).

Risk

LOW. The bootstrap branch is reachable only when:

  1. No storage-truth postponement at this epoch, AND
  2. No action-finalization postponement at this epoch, AND
  3. SN submitted a compliant self host-report, AND
  4. Epoch anchor exists AND active set is empty

In any normal operating state (≥1 ACTIVE), the branch is skipped and the legacy peer-port path runs unchanged. Determinism preserved.

Rollback

git revert <commit>. The legacy peer-port deadlock returns; recovery still possible via the documented deregister + re-register procedure (internal runbook: lumera-supernode-postponed-recovery).

Migration / upgrade notes

None. Pure logic change in EnforceEpochEnd's recovery branch. No state-key changes, no proto changes, no consensus version bump. Safe to ship as a minor-version logic patch (e.g. v1.20.x). Activates the moment the upgraded binary runs — no proposal-side toggle required.

IBC / interop

N/A — audit module is not IBC-bound.

Devnet rehearsal

Unit-level test matrix covers the bootstrap branch + the legacy-preservation cases with mocked supernode keeper. Live devnet rehearsal (force all SNs POSTPONED, wait for next epoch end, verify auto-recovery without manual deregister/register) is planned as part of the v1.20 upgrade-rehearsal gate.

Follow-ups (not in this PR)

  • PR-B (deferred): allow governance to clear RequiredOpenPorts via a sentinel value. Currently blocked by Params.WithDefaults silently re-filling the list. With PR-A landed, this is less urgent.
  • PR-C (deferred): bump ConsecutiveEpochsToPostpone default from 1 to 3 (or enforce a floor in Params.Validate). Reduces the probability of triggering the empty-set condition. Can be folded into the v1.12.0 upgrade handler (proposal not yet on chain) to apply at activation without a separate gov proposal.

Files changed

  • x/audit/v1/keeper/enforcement.go — +11 lines, bootstrap branch
  • x/audit/v1/keeper/enforcement_empty_active_set_test.go — test matrix

When the epoch's anchored active set is empty (all supernodes
POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd
becomes unsatisfiable by construction: with zero probers, no peer
report exists that could attest all-ports-OPEN for any POSTPONED
supernode. The chain cannot self-heal — every validator key holder
must perform a manual deregister+re-register cycle out-of-band.

Trigger on mainnet (live params at height 5,001,129):
- consecutive_epochs_to_postpone = 1 (one missed epoch postpones)
- 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin
- Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in
  lockstep → active set can drop to 0 → permanent deadlock.

Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty,
accept a compliant self host-report alone as sufficient for recovery.

The bootstrap exception sits AFTER the storage-truth and
action-finalization redirects (they keep their own recovery
semantics) and AFTER selfHostCompliant (a misbehaving SN cannot
self-recover via this branch). When no anchor exists for the epoch
(test fixture or pre-anchor edge case), the branch is skipped and
the legacy peer-port path runs unchanged.

Test matrix (5 cells):
- empty anchor + compliant self-report → recover
- empty anchor + no self-report → no-recover (self-gate)
- empty anchor + non-compliant self-report → no-recover (self-gate)
- non-empty anchor + no peer obs → no-recover (legacy preserved)
- no anchor → no-recover (legacy preserved)

The pre-fix scenario test that asserted deadlock
(TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is
inverted to its new contract: recovery succeeds via the bootstrap
exception when self-reports are compliant.

Risk: LOW. Reads existing deterministic state (EpochAnchor) only.
Branch is only reachable when no other recovery path applies and
self-compliance has already been verified. No new external calls,
no wall-clock dependency, no map iteration. Cosmos determinism
preserved.

Rollback: revert this commit. The legacy peer-port deadlock returns,
recoverable via the documented deregister+re-register procedure
(skill: lumera-supernode-postponed-recovery).

Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED;
gov proposal 33 to bypass via empty required_open_ports passed
on-chain but silently no-op'd because Params.WithDefaults() re-fills
the list. The deadlock is a real protocol-level design gap, not a
devnet quirk.
Two system tests in audit_empty_active_set_bootstrap_test.go were
written to document the empty-active-set DEADLOCK as expected
behavior (one used legacy MsgReportSupernodeMetrics to break it, the
other asserted 3 consecutive host-only-report epochs never recover).

With the bootstrap-recovery exception in shouldRecoverAtEpochEnd
(this PR's main change), the deadlock no longer exists: compliant
self host-reports alone are sufficient to recover when the active
set is empty.

Invert both tests to the new contract:

1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover
   (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock
    + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover)

   Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after
   submitting compliant host-only reports — no legacy metrics path
   needed. The chain self-heals.

2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW)

   Guards the self-compliance gate. With MinDiskFreePercent=20, a
   POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST
   remain POSTPONED even though the active set is empty. This blocks
   the exception from becoming a 'free pass' for misbehaving SNs and
   complements the unit-level violation tests in
   x/audit/v1/keeper/enforcement_empty_active_set_test.go.

Helpers added in audit_test_helpers_test.go:
- auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent.
- setAuditParamsForFastEpochsWithMinDiskFree: lets a test override
  MinDiskFreePercent in genesis.

Found by: PR #135 CI system-test failure (the previous tests asserted
the pre-fix deadlock contract). The original assertions are now
covered by historical context in commit messages and the skill
lumera-supernode-postponed-recovery.
@mateeullahmalik mateeullahmalik self-assigned this May 11, 2026
@mateeullahmalik mateeullahmalik merged commit d0e181d into master May 11, 2026
18 checks passed
mateeullahmalik added a commit that referenced this pull request May 11, 2026
Master added a full-scan golangci-lint hard gate in lint.yml (PR #135 era).
That supersedes this PR's original premise of being the *only* lint gate.

Reworked into an additive PR-only layer:
- Drop the lint.yml/.golangci.yml overrides; master's full-scan stays the
  authoritative correctness gate.
- Add .github/workflows/lint-pr.yml carrying:
  * reviewdog-wrapped diff-only golangci-lint (filter_mode: added) so
    findings inside the PR diff land as inline review comments rather
    than a single check-run log scrape.
  * govulncheck job as a separate security gate.

Copilot review applied to the new file:
- No `level: warning` next to `fail_level: error` — the two together
  silently downgrade every finding and bypass the gate. Drop `level`
  so each diagnostic's native severity is preserved.
- Split reviewdog step on `head.repo.full_name == github.repository`:
  same-repo PRs use `github-pr-review` (inline comments), fork PRs use
  `reporter: local` (log-only) because fork PR tokens are strictly
  read-only and neither pr-review nor pr-check can post results.
- govulncheck-action: keep `repo-checkout: false` from the previous
  commit (duplicate-Authorization-header fix).
mateeullahmalik added a commit that referenced this pull request May 11, 2026
…135) (#137)

* fix(x/audit): bootstrap-recovery exception when active set is empty

When the epoch's anchored active set is empty (all supernodes
POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd
becomes unsatisfiable by construction: with zero probers, no peer
report exists that could attest all-ports-OPEN for any POSTPONED
supernode. The chain cannot self-heal — every validator key holder
must perform a manual deregister+re-register cycle out-of-band.

Trigger on mainnet (live params at height 5,001,129):
- consecutive_epochs_to_postpone = 1 (one missed epoch postpones)
- 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin
- Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in
  lockstep → active set can drop to 0 → permanent deadlock.

Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty,
accept a compliant self host-report alone as sufficient for recovery.

The bootstrap exception sits AFTER the storage-truth and
action-finalization redirects (they keep their own recovery
semantics) and AFTER selfHostCompliant (a misbehaving SN cannot
self-recover via this branch). When no anchor exists for the epoch
(test fixture or pre-anchor edge case), the branch is skipped and
the legacy peer-port path runs unchanged.

Test matrix (5 cells):
- empty anchor + compliant self-report → recover
- empty anchor + no self-report → no-recover (self-gate)
- empty anchor + non-compliant self-report → no-recover (self-gate)
- non-empty anchor + no peer obs → no-recover (legacy preserved)
- no anchor → no-recover (legacy preserved)

The pre-fix scenario test that asserted deadlock
(TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is
inverted to its new contract: recovery succeeds via the bootstrap
exception when self-reports are compliant.

Risk: LOW. Reads existing deterministic state (EpochAnchor) only.
Branch is only reachable when no other recovery path applies and
self-compliance has already been verified. No new external calls,
no wall-clock dependency, no map iteration. Cosmos determinism
preserved.

Rollback: revert this commit. The legacy peer-port deadlock returns,
recoverable via the documented deregister+re-register procedure
(skill: lumera-supernode-postponed-recovery).

Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED;
gov proposal 33 to bypass via empty required_open_ports passed
on-chain but silently no-op'd because Params.WithDefaults() re-fills
the list. The deadlock is a real protocol-level design gap, not a
devnet quirk.

* test(systemtests): invert empty-active-set tests for bootstrap exception

Two system tests in audit_empty_active_set_bootstrap_test.go were
written to document the empty-active-set DEADLOCK as expected
behavior (one used legacy MsgReportSupernodeMetrics to break it, the
other asserted 3 consecutive host-only-report epochs never recover).

With the bootstrap-recovery exception in shouldRecoverAtEpochEnd
(this PR's main change), the deadlock no longer exists: compliant
self host-reports alone are sufficient to recover when the active
set is empty.

Invert both tests to the new contract:

1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover
   (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock
    + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover)

   Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after
   submitting compliant host-only reports — no legacy metrics path
   needed. The chain self-heals.

2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW)

   Guards the self-compliance gate. With MinDiskFreePercent=20, a
   POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST
   remain POSTPONED even though the active set is empty. This blocks
   the exception from becoming a 'free pass' for misbehaving SNs and
   complements the unit-level violation tests in
   x/audit/v1/keeper/enforcement_empty_active_set_test.go.

Helpers added in audit_test_helpers_test.go:
- auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent.
- setAuditParamsForFastEpochsWithMinDiskFree: lets a test override
  MinDiskFreePercent in genesis.

Found by: PR #135 CI system-test failure (the previous tests asserted
the pre-fix deadlock contract). The original assertions are now
covered by historical context in commit messages and the skill
lumera-supernode-postponed-recovery.

(cherry picked from commit d0e181d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants