Breaking up node_state.h
#7750
eddyashton
started this conversation in
Design
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
It's huge, fragile, increasingly unmaintainable. Here's a sketch of how we fix it.
(Meta-commentary - the file names aren't quite right, but we can resolve that separately. "std::variant" vs "collection of std::optional"s isn't something we need to pin down now. Both suck in their own way, we'd need to see which was actually sharper to hold).
NodeState Decomposition Plan: Type-State Builder
Motivation
NodeState(~3200 lines) is a god object that manages the entire lifecycle of aCCF node through a state machine (
uninitialized→initialized→pending/readingPublicLedger→ … →partOfNetwork). It has ~30 member fields, ~15setup methods, and three distinct startup paths (
Start,Join,Recover).The core problem is that fields are nullable/optional members of a single class,
and the code relies on implicit ordering invariants to ensure they're initialized
before use. These invariants are invisible to the compiler and routinely broken
by refactoring:
consensusduringrecovery (hooks fire during ledger replay, before consensus is constructed).
snapshot (hooks weren't registered when the snapshot was deserialized).
this->snapshotterby shared_ptr copy, so theymust be registered after
setup_snapshotter()— an ordering constraintexpressed nowhere in the type system.
The type-state pattern makes these constraints compile-time errors instead of
runtime crashes.
Core Idea
Replace the single
NodeStateclass with a chain of structs, where each structrepresents a lifecycle phase and holds exactly the fields available in that
phase. Transitioning to the next phase consumes the current struct and produces
the next, so it's impossible to access fields that don't exist yet.
Proposed Types
NodeCore— always available, constructed firstFields constructed in the
NodeStateconstructor and earlycreate():All hooks that don't depend on consensus are registered here. The
ConfigurationChangeHookwrites to aConfigurationTracker(a lightweightstd::list<Configuration>) that is also held here.NodeInitialized— post-attestation, pre-consensusNodeRecovering— reading public/private ledgerNo
consensusfield exists on this type — it is impossible to accidentallydereference it.
NodePending— join node waiting for network acceptanceNodeActive— fully operationalThe
consensusfield is non-optional. It is constructed during the transitioninto
NodeActive, consuming theConfigurationTracker's accumulatedconfigurations.
File Layout
NodeStateitself becomes a thin wrapper holding astd::variant<NodeInitialized, NodePending, NodeRecovering, NodeActive>,implementing the
AbstractNodeOperationandAbstractNodeStateinterfaces bydispatching to the active variant. Methods that are only valid in certain phases
(e.g.,
recover_public_ledger_entries) are only defined on the correspondingtype, so calling them in the wrong phase is a compile error inside the
implementation (variant access throws/asserts), rather than a silent
use-after-null.
Incremental Migration Path
This can be built incrementally. Each step is independently mergeable and
testable:
Step 0:
ConfigurationTracker(this PR)Extract configuration tracking into a standalone object that implements the two
ConfigurableConsensusmethods the hook needs (get_latest_configuration_unsafeand
add_configuration). Hooks write to it pre-consensus; its state is consumedby the
Raftconstructor. ~50 lines, zero behavioral change. Validates thepattern.
Step 1: Extract
NodeCoreMove the "always available" fields into a
NodeCorestruct.NodeStateholds aNodeCoremember. Allsetup_*methods becomeNodeCoremethods or freefunctions taking
NodeCore&. Hook registrations move toNodeCore::setup().This is a mechanical refactor — the public interface doesn't change.
Validation: All existing tests pass unchanged.
Step 2: Extract
NodeRecoveringMove recovery-specific fields (
last_recovered_idx,view_history,recovery_store, etc.) into aNodeRecoveringstruct.NodeStateholds it asstd::optional<NodeRecovering>. Recovery methods(
recover_public_ledger_entries,recover_private_ledger_entries) move toNodeRecovering.NodeStatedispatches to it.Validation: Recovery e2e tests (including SNP platform tests).
Step 3: Extract
NodePendingMove join-specific fields (
join_periodic_task,snapshot_fetch_task) intoNodePending. The join callback transitions fromNodePendingtoNodeActive.Validation: Join/rotation e2e tests.
Step 4: Introduce
NodeActiveand the variantReplace the state machine enum with a
std::variant. TheAbstractNodeOperation/
AbstractNodeStatemethods dispatch on the variant. Phase-specific methodsare only callable on the correct variant member.
Validation: Full CI.
Step 5: Remove
NodeStatestate machineThe
smfield andNodeStartupStateenum become redundant — the active variantmember is the state. Remove them.
Key Design Decisions
Why not a single abstract base class with virtual methods? The phases share
most fields (via
NodeCore), but the point is to make absence of fieldsvisible. A base class with virtual methods still allows null-deref on optional
members. The variant forces you to handle each case.
Why variant over inheritance? The phases are not substitutable — you can't
treat a recovering node as an active node. Variant makes this explicit.
Inheritance would re-introduce the "which fields are valid?" problem.
Thread safety:
NodeStatecurrently uses a singlelockmutex. Thisdoesn't change — the wrapper still holds the lock before dispatching to the
variant. The variant itself is only accessed under the lock.
Interface compatibility:
AbstractNodeOperationandAbstractNodeStatedon't change.
NodeStateimplements them by dispatching to the variant. Externalcode (frontends, host) sees no change.
What This Solves
NodeRecoveringhas noconsensusfieldConfigurationTrackeralways exists inNodeCoreNodeCore::setup(), once, earlyNodeCoreconstructs snapshotter before hooksBeta Was this translation helpful? Give feedback.
All reactions