Breaking up `node_state.h` #7750

eddyashton · 2026-03-18T14:23:32Z

eddyashton
Mar 18, 2026
Maintainer

It's huge, fragile, increasingly unmaintainable. Here's a sketch of how we fix it.

(Meta-commentary - the file names aren't quite right, but we can resolve that separately. "std::variant" vs "collection of std::optional"s isn't something we need to pin down now. Both suck in their own way, we'd need to see which was actually sharper to hold).

NodeState Decomposition Plan: Type-State Builder

Motivation

NodeState (~3200 lines) is a god object that manages the entire lifecycle of a
CCF node through a state machine (uninitialized → initialized → pending /
readingPublicLedger → … → partOfNetwork). It has ~30 member fields, ~15
setup methods, and three distinct startup paths (Start, Join, Recover).

The core problem is that fields are nullable/optional members of a single class,
and the code relies on implicit ordering invariants to ensure they're initialized
before use. These invariants are invisible to the compiler and routinely broken
by refactoring:

Moving a hook registration earlier caused a null-deref on consensus during
recovery (hooks fire during ledger replay, before consensus is constructed).
Moving it later caused joining nodes to miss configurations from their startup
snapshot (hooks weren't registered when the snapshot was deserialized).
The snapshotter hooks capture this->snapshotter by shared_ptr copy, so they
must be registered after setup_snapshotter() — an ordering constraint
expressed nowhere in the type system.

The type-state pattern makes these constraints compile-time errors instead of
runtime crashes.

Core Idea

Replace the single NodeState class with a chain of structs, where each struct
represents a lifecycle phase and holds exactly the fields available in that
phase. Transitioning to the next phase consumes the current struct and produces
the next, so it's impossible to access fields that don't exist yet.

NodeCore → NodeInitialized → NodePending (Join)
                            → NodeRecovering (Recover)
                            → NodeActive (Start, or after Join/Recover complete)

Proposed Types

`NodeCore` — always available, constructed first

Fields constructed in the NodeState constructor and early create():

struct NodeCore {
    // Identity (constructor)
    ccf::crypto::CurveID curve_id;
    std::shared_ptr<ccf::crypto::ECKeyPair_OpenSSL> node_sign_kp;
    NodeId self;
    std::shared_ptr<ccf::crypto::RSAKeyPair> node_encrypt_kp;
    ccf::crypto::Pem self_signed_node_cert;

    // Infrastructure (initialize/create)
    NetworkState& network;
    std::shared_ptr<RPCMap> rpc_map;
    std::shared_ptr<RPCSessions> rpcsessions;
    std::shared_ptr<NodeToNode> n2n_channels;
    ringbuffer::AbstractWriterFactory& writer_factory;
    ccf::StartupConfig config;

    // KV infrastructure (setup_* in create, before any startup path diverges)
    std::shared_ptr<MerkleTxHistory> history;
    std::shared_ptr<Snapshotter> snapshotter;
    std::shared_ptr<ccf::kv::AbstractTxEncryptor> encryptor;

    // Attestation
    QuoteInfo quote_info;
    pal::PlatformAttestationMeasurement node_measurement;
};

All hooks that don't depend on consensus are registered here. The
ConfigurationChangeHook writes to a ConfigurationTracker (a lightweight
std::list<Configuration>) that is also held here.

`NodeInitialized` — post-attestation, pre-consensus

struct NodeInitialized {
    NodeCore core;
    ConfigurationTracker config_tracker;

    // Attestation results (from launch_node)
    std::optional<pal::snp::TcbVersionRaw> snp_tcb_version;
    std::optional<pal::UVMEndorsements> snp_uvm_endorsements;

    // Snapshot loaded during find_local_startup_snapshot
    std::unique_ptr<StartupSnapshotInfo> startup_snapshot_info;

    // Methods
    NodeActive start(...);           // StartType::Start
    NodePending join(...);           // StartType::Join
    NodeRecovering recover(...);     // StartType::Recover
};

`NodeRecovering` — reading public/private ledger

struct NodeRecovering {
    NodeCore core;
    ConfigurationTracker config_tracker;

    // Recovery-specific state
    std::vector<ccf::kv::Version> view_history;
    ::consensus::Index last_recovered_idx;
    ::consensus::Index last_recovered_signed_idx;
    RecoveredEncryptedLedgerSecrets recovered_encrypted_ledger_secrets;

    // Transitions
    NodeActive complete_public_recovery(...);  // creates consensus, returns NodeActive
};

No consensus field exists on this type — it is impossible to accidentally
dereference it.

`NodePending` — join node waiting for network acceptance

struct NodePending {
    NodeCore core;
    ConfigurationTracker config_tracker;
    std::shared_ptr<ccf::tasks::Task> join_periodic_task;

    // Transition
    NodeActive accepted(JoinNetworkNodeToNode::Out resp);
};

`NodeActive` — fully operational

struct NodeActive {
    NodeCore core;
    std::shared_ptr<ccf::kv::Consensus> consensus;  // non-null, always
    std::shared_ptr<Forwarder<NodeToNode>> cmd_forwarder;
    std::shared_ptr<JwtKeyAutoRefresh> jwt_key_auto_refresh;

    // Recovery-only (optional, for private ledger phase)
    std::optional<RecoveryState> recovery;
};

The consensus field is non-optional. It is constructed during the transition
into NodeActive, consuming the ConfigurationTracker's accumulated
configurations.

File Layout

src/node/
  node_types.h              — NodeCore, NodeInitialized, etc. (struct definitions)
  node_core.cpp             — NodeCore setup (history, snapshotter, hooks, encryptor)
  node_recovery.cpp         — NodeRecovering methods (ledger replay)
  node_join.cpp             — NodePending methods (join timer, snapshot fetch)
  node_active.cpp           — NodeActive methods (tick, recv, governance)
  node_state.h              — NodeState wrapper (holds variant, implements interfaces)
  configuration_tracker.h   — Lightweight config accumulator

NodeState itself becomes a thin wrapper holding a
std::variant<NodeInitialized, NodePending, NodeRecovering, NodeActive>,
implementing the AbstractNodeOperation and AbstractNodeState interfaces by
dispatching to the active variant. Methods that are only valid in certain phases
(e.g., recover_public_ledger_entries) are only defined on the corresponding
type, so calling them in the wrong phase is a compile error inside the
implementation (variant access throws/asserts), rather than a silent
use-after-null.

Incremental Migration Path

This can be built incrementally. Each step is independently mergeable and
testable:

Step 0: `ConfigurationTracker` (this PR)

Extract configuration tracking into a standalone object that implements the two
ConfigurableConsensus methods the hook needs (get_latest_configuration_unsafe
and add_configuration). Hooks write to it pre-consensus; its state is consumed
by the Raft constructor. ~50 lines, zero behavioral change. Validates the
pattern.

Step 1: Extract `NodeCore`

Move the "always available" fields into a NodeCore struct. NodeState holds a
NodeCore member. All setup_* methods become NodeCore methods or free
functions taking NodeCore&. Hook registrations move to NodeCore::setup().
This is a mechanical refactor — the public interface doesn't change.

Validation: All existing tests pass unchanged.

Step 2: Extract `NodeRecovering`

Move recovery-specific fields (last_recovered_idx, view_history,
recovery_store, etc.) into a NodeRecovering struct. NodeState holds it as
std::optional<NodeRecovering>. Recovery methods
(recover_public_ledger_entries, recover_private_ledger_entries) move to
NodeRecovering. NodeState dispatches to it.

Validation: Recovery e2e tests (including SNP platform tests).

Step 3: Extract `NodePending`

Move join-specific fields (join_periodic_task, snapshot_fetch_task) into
NodePending. The join callback transitions from NodePending to NodeActive.

Validation: Join/rotation e2e tests.

Step 4: Introduce `NodeActive` and the variant

Replace the state machine enum with a std::variant. The AbstractNodeOperation
/ AbstractNodeState methods dispatch on the variant. Phase-specific methods
are only callable on the correct variant member.

Validation: Full CI.

Step 5: Remove `NodeState` state machine

The sm field and NodeStartupState enum become redundant — the active variant
member is the state. Remove them.

Key Design Decisions

Why not a single abstract base class with virtual methods? The phases share
most fields (via NodeCore), but the point is to make absence of fields
visible. A base class with virtual methods still allows null-deref on optional
members. The variant forces you to handle each case.

Why variant over inheritance? The phases are not substitutable — you can't
treat a recovering node as an active node. Variant makes this explicit.
Inheritance would re-introduce the "which fields are valid?" problem.

Thread safety: NodeState currently uses a single lock mutex. This
doesn't change — the wrapper still holds the lock before dispatching to the
variant. The variant itself is only accessed under the lock.

Interface compatibility: AbstractNodeOperation and AbstractNodeState
don't change. NodeState implements them by dispatching to the variant. External
code (frontends, host) sees no change.

What This Solves

Problem	How it's solved
Null consensus deref during recovery	`NodeRecovering` has no `consensus` field
Missing config from snapshot	`ConfigurationTracker` always exists in `NodeCore`
Hook registration order bugs	Hooks registered in `NodeCore::setup()`, once, early
Snapshotter hooks before snapshotter	`NodeCore` constructs snapshotter before hooks
"Which fields exist in this phase?"	Read the struct definition
3200-line god object	Split across 5 files, ~600 lines each

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking up `node_state.h` #7750

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Breaking up node_state.h #7750

Uh oh!

eddyashton Mar 18, 2026 Maintainer

NodeState Decomposition Plan: Type-State Builder

Motivation

Core Idea

Proposed Types

NodeCore — always available, constructed first

NodeInitialized — post-attestation, pre-consensus

NodeRecovering — reading public/private ledger

NodePending — join node waiting for network acceptance

NodeActive — fully operational

File Layout

Incremental Migration Path

Step 0: ConfigurationTracker (this PR)

Step 1: Extract NodeCore

Step 2: Extract NodeRecovering

Step 3: Extract NodePending

Step 4: Introduce NodeActive and the variant

Step 5: Remove NodeState state machine

Key Design Decisions

What This Solves

Replies: 0 comments

Breaking up `node_state.h` #7750

eddyashton
Mar 18, 2026
Maintainer

`NodeCore` — always available, constructed first

`NodeInitialized` — post-attestation, pre-consensus

`NodeRecovering` — reading public/private ledger

`NodePending` — join node waiting for network acceptance

`NodeActive` — fully operational

Step 0: `ConfigurationTracker` (this PR)

Step 1: Extract `NodeCore`

Step 2: Extract `NodeRecovering`

Step 3: Extract `NodePending`

Step 4: Introduce `NodeActive` and the variant

Step 5: Remove `NodeState` state machine