Skip to content

[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty#137

Merged
mateeullahmalik merged 1 commit into
1.12.0from
cherry-pick/audit-bootstrap-recovery-to-1.12.0
May 11, 2026
Merged

[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty#137
mateeullahmalik merged 1 commit into
1.12.0from
cherry-pick/audit-bootstrap-recovery-to-1.12.0

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Contributor

Cherry-picks PR #135 (fix(x/audit): bootstrap-recovery exception when active set is empty) onto the 1.12.0 release branch.

What this is

Clean git cherry-pick -x of squash-merge commit d0e181d from master. No conflicts. No code modifications during the cherry-pick — same fix, same tests.

Why this needs to be in v1.12.0

The empty-active-set deadlock is a real protocol design gap that triggers when all supernodes are simultaneously POSTPONED at an epoch boundary. Once in this state, the chain cannot self-heal via gov or any chain-level mechanism — only an out-of-band coordinated deregister + re-register cycle by every postponed validator key holder can recover it.

Why v1.12.0 specifically is the right place:

Live mainnet params (queried at height 5,001,129):

consecutive_epochs_to_postpone: 1        ← single missed epoch postpones
required_open_ports:            [4444, 4445, 8002]
epoch_length_blocks:            400      (~40 min)

Current mainnet supernode breakdown: 15 ACTIVE / 11 POSTPONED / 2 DISABLED. Thin active margin.

Upgrade-day trigger: chain halts at upgrade height; a subset of SN operators roll forward late (>40 min, one epoch). Those SNs miss one epoch report → POSTPONED in lockstep. If the surviving ACTIVE set drops to zero (plausible given current 15/28 active), the deadlock becomes permanent until every validator does the manual cycle.

v1.12.0 is the next upgrade after the long-running v1.11.x line. This fix MUST land in the binary that nodes upgrade INTO so the safety net is in place from block 1 of v1.12.0.

What changed

11-line fix in x/audit/v1/keeper/enforcement.go::shouldRecoverAtEpochEnd: when the epoch's anchored active set is empty, accept a compliant self host-report alone as sufficient for recovery. Sits AFTER the storage-truth and action-finalization redirects, AFTER selfHostCompliant — a misbehaving SN cannot self-recover via this branch.

Tests cover the full matrix (unit + systemtest) — see PR #135 for the table.

Verification on 1.12.0 branch

  • go build ./x/audit/... clean
  • go test ./x/audit/v1/keeper/... -run "EmptyActiveSet|NoEpochAnchor|NonEmptyActiveSet" — PASS

Risk

LOW. Same risk profile as PR #135 (already reviewed and merged to master). No state-key changes, no proto changes, no consensus version bump. The branch is only reachable when no other recovery path applies, self-compliance has already been verified, and the active set is empty — a pure safety net. In any normal operating state (≥1 ACTIVE), the branch is skipped and the legacy peer-port path runs unchanged. Cosmos determinism preserved.

Rollback

git revert the cherry-pick commit. The legacy peer-port deadlock returns; manual recovery via documented runbook.

Refs

…135)

* fix(x/audit): bootstrap-recovery exception when active set is empty

When the epoch's anchored active set is empty (all supernodes
POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd
becomes unsatisfiable by construction: with zero probers, no peer
report exists that could attest all-ports-OPEN for any POSTPONED
supernode. The chain cannot self-heal — every validator key holder
must perform a manual deregister+re-register cycle out-of-band.

Trigger on mainnet (live params at height 5,001,129):
- consecutive_epochs_to_postpone = 1 (one missed epoch postpones)
- 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin
- Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in
  lockstep → active set can drop to 0 → permanent deadlock.

Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty,
accept a compliant self host-report alone as sufficient for recovery.

The bootstrap exception sits AFTER the storage-truth and
action-finalization redirects (they keep their own recovery
semantics) and AFTER selfHostCompliant (a misbehaving SN cannot
self-recover via this branch). When no anchor exists for the epoch
(test fixture or pre-anchor edge case), the branch is skipped and
the legacy peer-port path runs unchanged.

Test matrix (5 cells):
- empty anchor + compliant self-report → recover
- empty anchor + no self-report → no-recover (self-gate)
- empty anchor + non-compliant self-report → no-recover (self-gate)
- non-empty anchor + no peer obs → no-recover (legacy preserved)
- no anchor → no-recover (legacy preserved)

The pre-fix scenario test that asserted deadlock
(TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is
inverted to its new contract: recovery succeeds via the bootstrap
exception when self-reports are compliant.

Risk: LOW. Reads existing deterministic state (EpochAnchor) only.
Branch is only reachable when no other recovery path applies and
self-compliance has already been verified. No new external calls,
no wall-clock dependency, no map iteration. Cosmos determinism
preserved.

Rollback: revert this commit. The legacy peer-port deadlock returns,
recoverable via the documented deregister+re-register procedure
(skill: lumera-supernode-postponed-recovery).

Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED;
gov proposal 33 to bypass via empty required_open_ports passed
on-chain but silently no-op'd because Params.WithDefaults() re-fills
the list. The deadlock is a real protocol-level design gap, not a
devnet quirk.

* test(systemtests): invert empty-active-set tests for bootstrap exception

Two system tests in audit_empty_active_set_bootstrap_test.go were
written to document the empty-active-set DEADLOCK as expected
behavior (one used legacy MsgReportSupernodeMetrics to break it, the
other asserted 3 consecutive host-only-report epochs never recover).

With the bootstrap-recovery exception in shouldRecoverAtEpochEnd
(this PR's main change), the deadlock no longer exists: compliant
self host-reports alone are sufficient to recover when the active
set is empty.

Invert both tests to the new contract:

1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover
   (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock
    + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover)

   Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after
   submitting compliant host-only reports — no legacy metrics path
   needed. The chain self-heals.

2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW)

   Guards the self-compliance gate. With MinDiskFreePercent=20, a
   POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST
   remain POSTPONED even though the active set is empty. This blocks
   the exception from becoming a 'free pass' for misbehaving SNs and
   complements the unit-level violation tests in
   x/audit/v1/keeper/enforcement_empty_active_set_test.go.

Helpers added in audit_test_helpers_test.go:
- auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent.
- setAuditParamsForFastEpochsWithMinDiskFree: lets a test override
  MinDiskFreePercent in genesis.

Found by: PR #135 CI system-test failure (the previous tests asserted
the pre-fix deadlock contract). The original assertions are now
covered by historical context in commit messages and the skill
lumera-supernode-postponed-recovery.

(cherry picked from commit d0e181d)
@mateeullahmalik mateeullahmalik merged commit f4681d9 into 1.12.0 May 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant