Skip to content

feat: per-secret snapshots with tiered retention for delegates#4034

Merged
sanity merged 4 commits into
mainfrom
feat/delegate-snapshots
May 13, 2026
Merged

feat: per-secret snapshots with tiered retention for delegates#4034
sanity merged 4 commits into
mainfrom
feat/delegate-snapshots

Conversation

@sanity
Copy link
Copy Markdown
Collaborator

@sanity sanity commented May 5, 2026

Problem

Delegate secrets had no defense against accidental overwrite or buggy delegate updates:

  • store_secret was last-write-wins with no history. A delegate that corrupted its own state had no recovery path; not even moments-ago data could be retrieved.
  • remove_secret had a real bug: it deleted the ciphertext file but left the ReDb secrets_index and the in-memory key_to_secret_part map pointing at the now-missing path. After removal, an index lookup would still claim the secret existed, while get_secret would return file-not-found. Stale entries accumulated forever and survived restarts.

These both showed up in a code review of how delegates store data and secrets (request from Ian, 2026-05-04). Issue #3050 covers the broader "vault delegate" multi-device sync design — this PR is the local-only durability slice that makes sense to ship without waiting on that design.

Solution

Snapshot-on-write (default-on). Before each store_secret overwrites an existing file, the prior ciphertext is renamed to {delegate_dir}/.snapshots/{secret_id}/{epoch_ms:020}. After the new write lands, the snapshot directory is thinned per a tiered retention policy:

  • Always keep the last 5 snapshots
  • Plus one per minute for the last 10 minutes
  • Plus one per hour for the last 24 hours
  • Plus one per day for the last week
  • Plus one per week for the last 4 weeks
  • Plus one per month for the last 12 months

Steady-state worst case ~62 snapshots per secret; under burst churn the policy collapses everything in the same slot to one entry. Rename (not hard-link) so File::create's truncate-in-place cannot corrupt the snapshot inode. Snapshots inherit the delegate's cipher; without the cipher key the bytes are useless. Best-effort: I/O failures during snapshot or thinning never block the primary write — they just log. FREENET_DISABLE_SECRET_SNAPSHOTS env override exists for ops who explicitly want the previous behavior.

remove_secret fix. Now updates the ReDb index, the in-memory map, and removes the snapshot history. The integration test (remove_secret_clears_index_and_snapshots) was verified against the original buggy body — it fails with ReDb index still contains removed secret hash, confirming the regression coverage is real.

The two changes are in one PR because the remove_secret regression test asserts on snapshot deletion as part of its post-conditions, so they can't land independently without test churn.

Testing

7 unit tests for the retention algorithm in secret_snapshots.rs cover empty input, last-N-unconditional retention, dense-history thinning to the expected slot count, burst-in-single-slot collapse, default-policy steady-state cap (525,600 minute-snapshots → ≤70 retained), future timestamps treated as age-zero, and zero-count buckets being inert.

6 integration tests in secrets_store.rs against a real on-disk SecretsStore:

  • second_write_snapshots_prior_value — second store_secret produces exactly one snapshot whose ciphertext decrypts back to the prior plaintext.
  • burst_writes_are_thinned — 50 rapid writes under a tight policy land in ≤4 snapshots.
  • remove_secret_clears_index_and_snapshots — verifies ReDb index cleared, in-memory map cleared, snapshot dir gone, get_secret returns MissingSecret. Confirmed to fail on the original buggy remove_secret body.
  • remove_nonexistent_secret_is_noop — boundary case.
  • first_write_creates_no_snapshot — boundary case.
  • store_and_load — pre-existing happy-path test, still passes.

cargo fmt clean. No new clippy warnings introduced (pre-existing warnings in tracing.rs and simulation_integration.rs are untouched). Full lib suite: 2546 passed, 0 failed.

Out of scope (follow-ups)

  • fdev delegate export/import CLI for off-device backups (requires archive format design and cipher handling — separate PR).
  • Manual rewording on freenet.org/build/manual/components/delegates/ around the "act as backups and replicas" claim until Delegates as durable vaults: multi-device sync and backup for private data #3050 ships (lives in the web/ repo).
  • Restore-from-snapshot CLI / API. Snapshots are preserved; user-facing recovery is currently manual file move.

[AI-assisted - Claude]

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Rule Review: No blocking issues; two minor testing gaps

Rules checked: code-style.md, testing.md, contracts.md, git-workflow.md, AGENTS.md (cleanup/GC rule referenced by PR itself)
Files reviewed: 3 (wasm_runtime.rs, wasm_runtime/secret_snapshots.rs, wasm_runtime/secrets_store.rs)

Warnings

None.

Info

  • crates/core/src/wasm_runtime/secret_snapshots.rs (no single line) — thin_snapshots has no dedicated unit test. The select_keep algorithm is thoroughly tested, and burst_writes_are_thinned exercises the I/O path end-to-end, but three thin_snapshots-specific behaviours go untested: (1) non-regular-file entries in the snapshot dir are skipped without deletion, (2) fs::read_dir failure logs a warning and returns early, (3) fs::remove_file failure is logged but doesn't propagate. These are all best-effort log-and-continue paths, so failing them would not corrupt data — but covering them explicitly would confirm the non-deletion guarantee for directories/symlinks in the snapshot dir. (rule: testing.md — error paths)

  • crates/core/src/wasm_runtime/secrets_store.rs:638DISABLE_SNAPSHOTS_ENV env-var check is never exercised by any test. The disabled_flag_suppresses_snapshots test calls set_snapshots_enabled(false) which bypasses the std::env::var_os(...) branch entirely. A logic inversion in the condition (is_some() vs is_none()) would pass all tests. A one-liner test that sets the env var and constructs a SecretsStore would close this gap. (rule: testing.md — error/config paths)

  • crates/core/src/wasm_runtime/secret_snapshots.rs:83-86max_age: Option<Duration> is pub with None permitted and documented as "older snapshots may stick around indefinitely." The PR's own doc comment asserts AGENTS.md compliance ("every retained entry has a finite lifetime"), but that claim is only true when max_age: Some(...). Since all fields are pub, a caller constructing a custom RetentionPolicy { max_age: None, .. } would create an unbounded retention store that violates the AGENTS.md cleanup-exemption rule. In production this never happens (only default() is used), but the public API makes the footgun accessible. Consider a RetentionPolicy::new(…, max_age: Duration) constructor that enforces the invariant, or at minimum a #[must_use] note in the None documentation. (rule: AGENTS.md cleanup/GC rule)


Rule review against .claude/rules/. WARNING findings block merge.

sanity added a commit that referenced this pull request May 5, 2026
Multi-perspective review (code-first, testing, skeptical, big-picture,
Codex) converged on three behavioral concerns with the initial
snapshot-on-write design plus several test gaps. This commit addresses
all of them.

**Durability (P1, code-first + Codex + skeptical concur).** The prior
sequence was rename(active → snap) → File::create(active), which left
the active path missing for the duration of the new write. A crash or
I/O failure in that window would surface as `MissingSecret` for a
secret that existed on both sides of the operation — a regression
compared to the pre-PR truncate-in-place behavior. Reworked to:

  hard_link(active, snap_path)   # snapshot points at OLD inode
  File::create(active.tmp)       # NEW ciphertext goes to fresh inode
  write_all + sync_all           # fully written, durable
  rename(active.tmp, active)     # atomic replacement of dir entry

The active path is therefore never absent: callers see either the old
or new value, never neither. Falls back to fs::copy on filesystems
that don't support hard links (FAT, some network mounts).

**remove_secret consistency (code-first + Codex + skeptical concur).**
Previously, on ReDb persistence failure the in-memory map was still
updated, so a subsequent restart would reload the stale ReDb state
and resurrect the deleted entry — exactly the bug the PR was
supposed to fix. Now mirrors store_secret: ReDb error is fatal,
in-memory map is only updated after the persistent commit succeeds.

**Snapshot path safety (code-first + skeptical concur).** Hoisted the
collision-suffix walk into `next_snapshot_path` which returns
AlreadyExists after exhausting 1024 slots instead of silently
overwriting the 1024th snapshot. Tightened `parse_snapshot_stamp` to
accept only `{digits}` or `{digits}.{digits}` (so e.g. `42.tmp` is
not mistaken for a year-1970 snapshot). thin_snapshots now filters
to regular files via DirEntry::file_type, preventing spurious
`remove_file failed` warnings on directories that happen to have
all-digit names.

**Documentation.**
  - SecretsStore type-level rustdoc explicitly documents that the
    type is not internally synchronized; &mut self is the lock.
  - Comments at the SystemTime imports explain why TimeSource (the
    project-wide convention in crates/core/) is the wrong abstraction
    here — snapshot file names embed epoch_ms that must remain
    meaningful across restarts; TimeSource has no stable epoch.
  - in-memory map doc-comment notes the persistence-first invariant.

**New tests** filling gaps testing-reviewer flagged:
  - parse_snapshot_stamp_accepts_valid_shapes / _rejects_garbage
    (direct unit tests for the tightened parser)
  - next_snapshot_path_uses_unsuffixed_when_free / _falls_back_to_
    suffix_on_collision (locks the suffix-collision branch behavior)
  - disabled_flag_suppresses_snapshots (covers the env-var path via
    a test-only setter)
  - delegates_have_disjoint_snapshot_histories (locks per-delegate
    isolation when two delegates use the same SecretsId)
  - sleep(2ms) → 5ms with explanatory comment on slow-CI tolerance

Total new tests: 4 unit + 2 integration. Full lib suite: 2552 passed.

[AI-assisted - Claude]
@sanity
Copy link
Copy Markdown
Collaborator Author

sanity commented May 5, 2026

Review pass complete

Ran four internal review agents (code-first, testing, skeptical, big-picture) plus Codex CLI in parallel. Findings consolidated below; addressed in commit b03b1db.

Convergent findings (multiple reviewers concurred)

# Finding Severity Status
1 Rename-before-write left active path missing on crash/IO failure (regressed pre-PR durability) P1 Fixed: hard_link to snapshot + write-tmp + atomic rename
2 remove_secret swallowed ReDb errors → in-mem updated even on persist failure → restart resurrects deleted entry P1 Fixed: ReDb error now fatal, mirroring store_secret
3 Suffix-loop silently overwrote .1023 after 1024 collisions P2 Fixed: returns AlreadyExists; caller logs
4 parse_snapshot_stamp accepted 42.tmp and similar; thin_snapshots didn't filter directories P2 Fixed: tightened parser to digits-only; filters via DirEntry::file_type

Test gaps closed

  • FREENET_DISABLE_SECRET_SNAPSHOTS opt-out path now covered (disabled_flag_suppresses_snapshots)
  • Collision-suffix branch covered (next_snapshot_path_falls_back_to_suffix_on_collision)
  • Multi-delegate isolation covered (delegates_have_disjoint_snapshot_histories)
  • parse_snapshot_stamp direct unit tests for valid/garbage shapes
  • Load-bearing sleep(2ms) bumped to 5ms with explanatory comment

Total tests: 11 unit + 8 integration (was 7 + 6). Full lib suite: 2552 passing.

Documentation

  • SecretsStore rustdoc explicitly documents that the type is not internally synchronized; &mut self is the lock. (Skeptical-reviewer flagged this as a footgun for future agents wrapping it in interior mutability.)
  • Comments at the SystemTime imports explain why TimeSource (the project-wide convention in crates/core/) is the wrong abstraction here — snapshot file names embed epoch_ms that must remain meaningful across restarts; TimeSource has no stable epoch. (Big-picture-reviewer flagged this as a code-style.md tension.)

Deferred work — now tracked as issues

Per finish-the-fix.md, follow-ups are filed as issues rather than relying on PR-description visibility:

Findings not addressed (with rationale)

  • NTP backwards step (skeptical, minor): real but extreme edge case; would require monotonic-counter component in filenames. Filing as issue would be premature optimization.
  • Cipher rotation invalidates old snapshots (skeptical, minor): documented behavior — snapshots inherit whatever cipher was active at write time. A future "embed cipher fingerprint in snapshot" change is part of the export-bundle work in feat(delegates): fdev CLI to export/import a delegate's secrets bundle #4035.
  • keep_last: 0 + empty buckets footgun (skeptical, blocker-flagged): RetentionPolicy is a public struct; constructing one is opt-in and tested behavior (zero_max_count_bucket_is_inert). The default is sensible. A validating builder is overkill for a struct used internally.

[AI-assisted - Claude]

sanity added 4 commits May 12, 2026 09:21
Delegate secrets had no defense against accidental overwrite or buggy
delegate updates: store_secret was last-write-wins with no history. If
a delegate corrupted its own state, the prior value was permanently
gone, and there was no way for users to recover even moments-ago data.

Two changes here, in one PR because the regression test for the bug
relies on the new snapshot infrastructure to make a meaningful
end-to-end assertion:

1. **Snapshot-on-write** (default-on). Before each store_secret
   overwrites an existing file, it renames the prior ciphertext into
   {delegate_dir}/.snapshots/{secret_id}/{epoch_ms}. After the new
   write lands, the snapshot directory is thinned per a tiered
   retention policy: keep last 5 + one per minute / hour / day / week
   / month, capping steady-state retention at ~62 entries per secret.
   Rename (not hard-link) so File::create's truncate-in-place can't
   corrupt the snapshot inode. Snapshots inherit the delegate's
   cipher; without the cipher key the bytes are useless.

   Best-effort: I/O failures during snapshot or thinning never block
   the primary write; they just log. FREENET_DISABLE_SECRET_SNAPSHOTS
   env override exists for ops with extreme disk pressure.

2. **remove_secret bug fix.** The old implementation deleted the file
   only and left both the ReDb secrets index and the in-memory
   key_to_secret_part map pointing at the now-missing path. After
   removal, has_secret() (index-driven) returned true while
   get_secret() failed file-not-found - a silent index leak that
   accumulated forever and survived restarts. Fix updates both
   indices and removes the snapshot history.

Tests: 7 unit tests for the retention algorithm (empty / dense burst /
steady-state cap / future timestamps / zero-count bucket boundaries),
plus 6 integration tests against a real on-disk SecretsStore covering
the snapshot round-trip with cipher decryption, burst-write thinning,
remove_secret index cleanup (regression test verified to fail against
the original buggy body), no-snapshot-on-first-write, and no-op
removal of a never-written secret.

[AI-assisted - Claude]
Multi-perspective review (code-first, testing, skeptical, big-picture,
Codex) converged on three behavioral concerns with the initial
snapshot-on-write design plus several test gaps. This commit addresses
all of them.

**Durability (P1, code-first + Codex + skeptical concur).** The prior
sequence was rename(active → snap) → File::create(active), which left
the active path missing for the duration of the new write. A crash or
I/O failure in that window would surface as `MissingSecret` for a
secret that existed on both sides of the operation — a regression
compared to the pre-PR truncate-in-place behavior. Reworked to:

  hard_link(active, snap_path)   # snapshot points at OLD inode
  File::create(active.tmp)       # NEW ciphertext goes to fresh inode
  write_all + sync_all           # fully written, durable
  rename(active.tmp, active)     # atomic replacement of dir entry

The active path is therefore never absent: callers see either the old
or new value, never neither. Falls back to fs::copy on filesystems
that don't support hard links (FAT, some network mounts).

**remove_secret consistency (code-first + Codex + skeptical concur).**
Previously, on ReDb persistence failure the in-memory map was still
updated, so a subsequent restart would reload the stale ReDb state
and resurrect the deleted entry — exactly the bug the PR was
supposed to fix. Now mirrors store_secret: ReDb error is fatal,
in-memory map is only updated after the persistent commit succeeds.

**Snapshot path safety (code-first + skeptical concur).** Hoisted the
collision-suffix walk into `next_snapshot_path` which returns
AlreadyExists after exhausting 1024 slots instead of silently
overwriting the 1024th snapshot. Tightened `parse_snapshot_stamp` to
accept only `{digits}` or `{digits}.{digits}` (so e.g. `42.tmp` is
not mistaken for a year-1970 snapshot). thin_snapshots now filters
to regular files via DirEntry::file_type, preventing spurious
`remove_file failed` warnings on directories that happen to have
all-digit names.

**Documentation.**
  - SecretsStore type-level rustdoc explicitly documents that the
    type is not internally synchronized; &mut self is the lock.
  - Comments at the SystemTime imports explain why TimeSource (the
    project-wide convention in crates/core/) is the wrong abstraction
    here — snapshot file names embed epoch_ms that must remain
    meaningful across restarts; TimeSource has no stable epoch.
  - in-memory map doc-comment notes the persistence-first invariant.

**New tests** filling gaps testing-reviewer flagged:
  - parse_snapshot_stamp_accepts_valid_shapes / _rejects_garbage
    (direct unit tests for the tightened parser)
  - next_snapshot_path_uses_unsuffixed_when_free / _falls_back_to_
    suffix_on_collision (locks the suffix-collision branch behavior)
  - disabled_flag_suppresses_snapshots (covers the env-var path via
    a test-only setter)
  - delegates_have_disjoint_snapshot_histories (locks per-delegate
    isolation when two delegates use the same SecretsId)
  - sleep(2ms) → 5ms with explanatory comment on slow-CI tolerance

Total new tests: 4 unit + 2 integration. Full lib suite: 2552 passed.

[AI-assisted - Claude]
Addresses the claude-rule-review bot finding on PR #4034: the
`keep_last` clause was a permanent exemption from snapshot GC with no
TTL or absolute-age override. Per AGENTS.md "WHEN writing cleanup/GC
logic": cleanup exemptions MUST be time-bounded.

`RetentionPolicy` now carries `max_age: Option<Duration>`, applied as a
final filter in `select_keep`: any otherwise-selected entry older than
`now - max_age` is dropped. Default is `Some(2 years)` — clear headroom
above the month bucket (12 months) so the bucket retention is
unaffected, while stale ciphertext from secrets the user stopped
writing eventually ages out instead of lingering forever. `None`
preserves the old behavior for callers that want it.

Also reorders `secret_snapshots.rs` per code-style.md
(`imports → types → trait impls → functions`): `RetentionBucket`,
`RetentionPolicy`, and their impls now precede `snapshot_dir_for`,
`next_snapshot_path`, `thin_snapshots`, `parse_snapshot_stamp`. The
bot flagged the prior ordering as a style violation.

New tests:
  - `max_age_overrides_keep_last`: stale entries selected by
    `keep_last` are dropped — the load-bearing invariant for the
    cleanup-exemption rule.
  - `max_age_preserves_future_timestamps`: clock-skew safety,
    `duration_since` returning Err keeps the entry.
  - `max_age_drops_only_stale_entries`: mixed-age input behaves
    correctly under `keep_last: 10, max_age: 120s`.

[AI-assisted - Claude]
The new lint level (rust-1.93.0) flags `let _ = result;` on
`#[must_use]` types. Rebase onto current main pulled it in.

The tmp-file cleanup on rename failure is intentionally best-effort
(a stale tmp left behind is harmless debris), so handle the error
explicitly and log at debug level rather than discarding silently.

[AI-assisted - Claude]
@sanity sanity force-pushed the feat/delegate-snapshots branch from b03b1db to e6023b0 Compare May 12, 2026 14:48
@sanity
Copy link
Copy Markdown
Collaborator Author

sanity commented May 12, 2026

Update pass

Rebased onto current main (48 commits) and addressed the outstanding feedback.

Bot warning

The claude-rule-review flag on RetentionPolicy::keep_last exempting snapshots from GC indefinitely is fixed:

  • Added max_age: Option<Duration> to RetentionPolicy, applied as a final filter in select_keep so it overrides every other clause (including keep_last).
  • Default is Some(2 years) — clear headroom above the month bucket (12 months) so the default tier retention is unaffected, while stale ciphertext from secrets the user stopped writing eventually ages out.
  • None preserves the prior unbounded behavior for callers that explicitly want it.

3 new unit tests pin the load-bearing invariants:

  • max_age_overrides_keep_last — stale entries selected by keep_last are dropped (the regression-proof for the cleanup-exemption rule).
  • max_age_preserves_future_timestamps — clock-skew safety.
  • max_age_drops_only_stale_entries — mixed-age input behaves correctly.

Style note

Reordered secret_snapshots.rs per code-style.md (imports → types → trait impls → functions): RetentionBucket, RetentionPolicy, and their impls now precede the standalone snapshot helpers.

Rebase

Resolved one new clippy lint that arrived with the rebase (clippy::let_underscore_must_use, new in rust-1.93.0) on the tmp-file cleanup path. Handled explicitly with a debug-level log instead of let _ = ....

Tests

  • All 14 snapshot algorithm unit tests pass (was 11).
  • All 8 secrets_store integration tests pass.
  • Full lib suite 2510 tests, 0 failed.
  • cargo clippy -p freenet --lib --tests clean (unrelated --all-features errors in tracing.rs are pre-existing on main, not from this PR).

Branch is now e6023b04, three commits: original feature, review-fix pass, and this max_age + rebase pass.

[AI-assisted - Claude]

@sanity sanity added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit a998338 May 13, 2026
13 checks passed
@sanity sanity deleted the feat/delegate-snapshots branch May 13, 2026 15:31
sanity added a commit that referenced this pull request May 13, 2026
…crets

PR #4034 added per-secret snapshot-on-write history but left no way to
actually use it — users would have had to manually move files in
`.snapshots/{secret_id}/` over the active path. This adds the read and
restore APIs that the upcoming `fdev delegate snapshots` CLI (issue
#4036) will wrap.

API surface:

- `SecretsStore::list_snapshots(delegate, key) -> Vec<SnapshotMetadata>`:
  enumerates snapshot history oldest-first. Empty Vec for a never-written
  secret (not an error). Returns timestamp_ms + suffix (for same-ms
  collisions) + path + size_bytes so callers can display history without
  needing to decrypt.

- `SecretsStore::restore_snapshot(delegate, key, timestamp_ms)`: promotes
  a captured snapshot back to the active path through the same
  hard-link + atomic-rename discipline as `store_secret`, taking a fresh
  snapshot of the value being replaced first so the restore is itself
  reversible. Re-adds the secret to the ReDb index + in-memory map if a
  prior `remove_secret` had cleared them.

Doesn't decrypt on restore: the snapshot is ciphertext written by the
same delegate's cipher, so it remains decryptable by `get_secret` without
any re-keying. Means restore is usable even when the cipher hasn't been
registered yet (recovery scenario).

`SecretStoreError::SnapshotNotFound { key, timestamp_ms }` is the new
typed error variant so the CLI can render a precise message instead of
a generic IO failure.

CLI work split off per the issue's own guidance ("User-facing recovery
UX has its own design questions ... worth its own review cycle"); see
follow-up issue #4036. Re-exports `SnapshotMetadata` + `SecretStoreError`
through `freenet::dev_tool` so the CLI consumer can land as a stacked PR.

Tests: 5 new SecretsStore-level tests cover empty list, multi-write
history ordering, happy-path restore + reversibility snapshot,
SnapshotNotFound on unknown timestamp, and restore-after-remove
re-populating the index. 4 new secret_snapshots tests cover the new
parse_snapshot_name helper, list_snapshots sort order (including the
None-before-Some(0) edge case that initially failed and exposed a real
sort-key bug), collision-suffix disambiguation, and missing-dir
returning empty.

Refs: #4036

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant