[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty by mateeullahmalik · Pull Request #137 · LumeraProtocol/lumera

mateeullahmalik · 2026-05-11T19:38:29Z

Cherry-picks PR #135 (fix(x/audit): bootstrap-recovery exception when active set is empty) onto the 1.12.0 release branch.

What this is

Clean git cherry-pick -x of squash-merge commit d0e181d from master. No conflicts. No code modifications during the cherry-pick — same fix, same tests.

Why this needs to be in v1.12.0

The empty-active-set deadlock is a real protocol design gap that triggers when all supernodes are simultaneously POSTPONED at an epoch boundary. Once in this state, the chain cannot self-heal via gov or any chain-level mechanism — only an out-of-band coordinated deregister + re-register cycle by every postponed validator key holder can recover it.

Why v1.12.0 specifically is the right place:

Live mainnet params (queried at height 5,001,129):

consecutive_epochs_to_postpone: 1        ← single missed epoch postpones
required_open_ports:            [4444, 4445, 8002]
epoch_length_blocks:            400      (~40 min)

Current mainnet supernode breakdown: 15 ACTIVE / 11 POSTPONED / 2 DISABLED. Thin active margin.

Upgrade-day trigger: chain halts at upgrade height; a subset of SN operators roll forward late (>40 min, one epoch). Those SNs miss one epoch report → POSTPONED in lockstep. If the surviving ACTIVE set drops to zero (plausible given current 15/28 active), the deadlock becomes permanent until every validator does the manual cycle.

v1.12.0 is the next upgrade after the long-running v1.11.x line. This fix MUST land in the binary that nodes upgrade INTO so the safety net is in place from block 1 of v1.12.0.

What changed

11-line fix in x/audit/v1/keeper/enforcement.go::shouldRecoverAtEpochEnd: when the epoch's anchored active set is empty, accept a compliant self host-report alone as sufficient for recovery. Sits AFTER the storage-truth and action-finalization redirects, AFTER selfHostCompliant — a misbehaving SN cannot self-recover via this branch.

Tests cover the full matrix (unit + systemtest) — see PR #135 for the table.

Verification on 1.12.0 branch

go build ./x/audit/... clean
go test ./x/audit/v1/keeper/... -run "EmptyActiveSet|NoEpochAnchor|NonEmptyActiveSet" — PASS

Risk

LOW. Same risk profile as PR #135 (already reviewed and merged to master). No state-key changes, no proto changes, no consensus version bump. The branch is only reachable when no other recovery path applies, self-compliance has already been verified, and the active set is empty — a pure safety net. In any normal operating state (≥1 ACTIVE), the branch is skipped and the legacy peer-port path runs unchanged. Cosmos determinism preserved.

Rollback

git revert the cherry-pick commit. The legacy peer-port deadlock returns; manual recovery via documented runbook.

Refs

Upstream PR: fix(x/audit): bootstrap-recovery exception when active set is empty #135
Master commit: d0e181d9b159d19efa22901fa606e078352f947
Internal runbook: lumera-supernode-postponed-recovery

…135) * fix(x/audit): bootstrap-recovery exception when active set is empty When the epoch's anchored active set is empty (all supernodes POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction: with zero probers, no peer report exists that could attest all-ports-OPEN for any POSTPONED supernode. The chain cannot self-heal — every validator key holder must perform a manual deregister+re-register cycle out-of-band. Trigger on mainnet (live params at height 5,001,129): - consecutive_epochs_to_postpone = 1 (one missed epoch postpones) - 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin - Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in lockstep → active set can drop to 0 → permanent deadlock. Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, accept a compliant self host-report alone as sufficient for recovery. The bootstrap exception sits AFTER the storage-truth and action-finalization redirects (they keep their own recovery semantics) and AFTER selfHostCompliant (a misbehaving SN cannot self-recover via this branch). When no anchor exists for the epoch (test fixture or pre-anchor edge case), the branch is skipped and the legacy peer-port path runs unchanged. Test matrix (5 cells): - empty anchor + compliant self-report → recover - empty anchor + no self-report → no-recover (self-gate) - empty anchor + non-compliant self-report → no-recover (self-gate) - non-empty anchor + no peer obs → no-recover (legacy preserved) - no anchor → no-recover (legacy preserved) The pre-fix scenario test that asserted deadlock (TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is inverted to its new contract: recovery succeeds via the bootstrap exception when self-reports are compliant. Risk: LOW. Reads existing deterministic state (EpochAnchor) only. Branch is only reachable when no other recovery path applies and self-compliance has already been verified. No new external calls, no wall-clock dependency, no map iteration. Cosmos determinism preserved. Rollback: revert this commit. The legacy peer-port deadlock returns, recoverable via the documented deregister+re-register procedure (skill: lumera-supernode-postponed-recovery). Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED; gov proposal 33 to bypass via empty required_open_ports passed on-chain but silently no-op'd because Params.WithDefaults() re-fills the list. The deadlock is a real protocol-level design gap, not a devnet quirk. * test(systemtests): invert empty-active-set tests for bootstrap exception Two system tests in audit_empty_active_set_bootstrap_test.go were written to document the empty-active-set DEADLOCK as expected behavior (one used legacy MsgReportSupernodeMetrics to break it, the other asserted 3 consecutive host-only-report epochs never recover). With the bootstrap-recovery exception in shouldRecoverAtEpochEnd (this PR's main change), the deadlock no longer exists: compliant self host-reports alone are sufficient to recover when the active set is empty. Invert both tests to the new contract: 1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover) Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after submitting compliant host-only reports — no legacy metrics path needed. The chain self-heals. 2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW) Guards the self-compliance gate. With MinDiskFreePercent=20, a POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST remain POSTPONED even though the active set is empty. This blocks the exception from becoming a 'free pass' for misbehaving SNs and complements the unit-level violation tests in x/audit/v1/keeper/enforcement_empty_active_set_test.go. Helpers added in audit_test_helpers_test.go: - auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent. - setAuditParamsForFastEpochsWithMinDiskFree: lets a test override MinDiskFreePercent in genesis. Found by: PR #135 CI system-test failure (the previous tests asserted the pre-fix deadlock contract). The original assertions are now covered by historical context in commit messages and the skill lumera-supernode-postponed-recovery. (cherry picked from commit d0e181d)

mateeullahmalik merged commit f4681d9 into 1.12.0 May 11, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty#137

[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty#137
mateeullahmalik merged 1 commit into
1.12.0from
cherry-pick/audit-bootstrap-recovery-to-1.12.0

mateeullahmalik commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mateeullahmalik commented May 11, 2026

What this is

Why this needs to be in v1.12.0

What changed

Verification on 1.12.0 branch

Risk

Rollback

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant