feat: per-secret snapshots with tiered retention for delegates#4034
Conversation
Rule Review: No blocking issues; two minor testing gapsRules checked: WarningsNone. Info
Rule review against |
Multi-perspective review (code-first, testing, skeptical, big-picture,
Codex) converged on three behavioral concerns with the initial
snapshot-on-write design plus several test gaps. This commit addresses
all of them.
**Durability (P1, code-first + Codex + skeptical concur).** The prior
sequence was rename(active → snap) → File::create(active), which left
the active path missing for the duration of the new write. A crash or
I/O failure in that window would surface as `MissingSecret` for a
secret that existed on both sides of the operation — a regression
compared to the pre-PR truncate-in-place behavior. Reworked to:
hard_link(active, snap_path) # snapshot points at OLD inode
File::create(active.tmp) # NEW ciphertext goes to fresh inode
write_all + sync_all # fully written, durable
rename(active.tmp, active) # atomic replacement of dir entry
The active path is therefore never absent: callers see either the old
or new value, never neither. Falls back to fs::copy on filesystems
that don't support hard links (FAT, some network mounts).
**remove_secret consistency (code-first + Codex + skeptical concur).**
Previously, on ReDb persistence failure the in-memory map was still
updated, so a subsequent restart would reload the stale ReDb state
and resurrect the deleted entry — exactly the bug the PR was
supposed to fix. Now mirrors store_secret: ReDb error is fatal,
in-memory map is only updated after the persistent commit succeeds.
**Snapshot path safety (code-first + skeptical concur).** Hoisted the
collision-suffix walk into `next_snapshot_path` which returns
AlreadyExists after exhausting 1024 slots instead of silently
overwriting the 1024th snapshot. Tightened `parse_snapshot_stamp` to
accept only `{digits}` or `{digits}.{digits}` (so e.g. `42.tmp` is
not mistaken for a year-1970 snapshot). thin_snapshots now filters
to regular files via DirEntry::file_type, preventing spurious
`remove_file failed` warnings on directories that happen to have
all-digit names.
**Documentation.**
- SecretsStore type-level rustdoc explicitly documents that the
type is not internally synchronized; &mut self is the lock.
- Comments at the SystemTime imports explain why TimeSource (the
project-wide convention in crates/core/) is the wrong abstraction
here — snapshot file names embed epoch_ms that must remain
meaningful across restarts; TimeSource has no stable epoch.
- in-memory map doc-comment notes the persistence-first invariant.
**New tests** filling gaps testing-reviewer flagged:
- parse_snapshot_stamp_accepts_valid_shapes / _rejects_garbage
(direct unit tests for the tightened parser)
- next_snapshot_path_uses_unsuffixed_when_free / _falls_back_to_
suffix_on_collision (locks the suffix-collision branch behavior)
- disabled_flag_suppresses_snapshots (covers the env-var path via
a test-only setter)
- delegates_have_disjoint_snapshot_histories (locks per-delegate
isolation when two delegates use the same SecretsId)
- sleep(2ms) → 5ms with explanatory comment on slow-CI tolerance
Total new tests: 4 unit + 2 integration. Full lib suite: 2552 passed.
[AI-assisted - Claude]
Review pass completeRan four internal review agents (code-first, testing, skeptical, big-picture) plus Codex CLI in parallel. Findings consolidated below; addressed in commit b03b1db. Convergent findings (multiple reviewers concurred)
Test gaps closed
Total tests: 11 unit + 8 integration (was 7 + 6). Full lib suite: 2552 passing. Documentation
Deferred work — now tracked as issuesPer
Findings not addressed (with rationale)
[AI-assisted - Claude] |
Delegate secrets had no defense against accidental overwrite or buggy
delegate updates: store_secret was last-write-wins with no history. If
a delegate corrupted its own state, the prior value was permanently
gone, and there was no way for users to recover even moments-ago data.
Two changes here, in one PR because the regression test for the bug
relies on the new snapshot infrastructure to make a meaningful
end-to-end assertion:
1. **Snapshot-on-write** (default-on). Before each store_secret
overwrites an existing file, it renames the prior ciphertext into
{delegate_dir}/.snapshots/{secret_id}/{epoch_ms}. After the new
write lands, the snapshot directory is thinned per a tiered
retention policy: keep last 5 + one per minute / hour / day / week
/ month, capping steady-state retention at ~62 entries per secret.
Rename (not hard-link) so File::create's truncate-in-place can't
corrupt the snapshot inode. Snapshots inherit the delegate's
cipher; without the cipher key the bytes are useless.
Best-effort: I/O failures during snapshot or thinning never block
the primary write; they just log. FREENET_DISABLE_SECRET_SNAPSHOTS
env override exists for ops with extreme disk pressure.
2. **remove_secret bug fix.** The old implementation deleted the file
only and left both the ReDb secrets index and the in-memory
key_to_secret_part map pointing at the now-missing path. After
removal, has_secret() (index-driven) returned true while
get_secret() failed file-not-found - a silent index leak that
accumulated forever and survived restarts. Fix updates both
indices and removes the snapshot history.
Tests: 7 unit tests for the retention algorithm (empty / dense burst /
steady-state cap / future timestamps / zero-count bucket boundaries),
plus 6 integration tests against a real on-disk SecretsStore covering
the snapshot round-trip with cipher decryption, burst-write thinning,
remove_secret index cleanup (regression test verified to fail against
the original buggy body), no-snapshot-on-first-write, and no-op
removal of a never-written secret.
[AI-assisted - Claude]
Multi-perspective review (code-first, testing, skeptical, big-picture,
Codex) converged on three behavioral concerns with the initial
snapshot-on-write design plus several test gaps. This commit addresses
all of them.
**Durability (P1, code-first + Codex + skeptical concur).** The prior
sequence was rename(active → snap) → File::create(active), which left
the active path missing for the duration of the new write. A crash or
I/O failure in that window would surface as `MissingSecret` for a
secret that existed on both sides of the operation — a regression
compared to the pre-PR truncate-in-place behavior. Reworked to:
hard_link(active, snap_path) # snapshot points at OLD inode
File::create(active.tmp) # NEW ciphertext goes to fresh inode
write_all + sync_all # fully written, durable
rename(active.tmp, active) # atomic replacement of dir entry
The active path is therefore never absent: callers see either the old
or new value, never neither. Falls back to fs::copy on filesystems
that don't support hard links (FAT, some network mounts).
**remove_secret consistency (code-first + Codex + skeptical concur).**
Previously, on ReDb persistence failure the in-memory map was still
updated, so a subsequent restart would reload the stale ReDb state
and resurrect the deleted entry — exactly the bug the PR was
supposed to fix. Now mirrors store_secret: ReDb error is fatal,
in-memory map is only updated after the persistent commit succeeds.
**Snapshot path safety (code-first + skeptical concur).** Hoisted the
collision-suffix walk into `next_snapshot_path` which returns
AlreadyExists after exhausting 1024 slots instead of silently
overwriting the 1024th snapshot. Tightened `parse_snapshot_stamp` to
accept only `{digits}` or `{digits}.{digits}` (so e.g. `42.tmp` is
not mistaken for a year-1970 snapshot). thin_snapshots now filters
to regular files via DirEntry::file_type, preventing spurious
`remove_file failed` warnings on directories that happen to have
all-digit names.
**Documentation.**
- SecretsStore type-level rustdoc explicitly documents that the
type is not internally synchronized; &mut self is the lock.
- Comments at the SystemTime imports explain why TimeSource (the
project-wide convention in crates/core/) is the wrong abstraction
here — snapshot file names embed epoch_ms that must remain
meaningful across restarts; TimeSource has no stable epoch.
- in-memory map doc-comment notes the persistence-first invariant.
**New tests** filling gaps testing-reviewer flagged:
- parse_snapshot_stamp_accepts_valid_shapes / _rejects_garbage
(direct unit tests for the tightened parser)
- next_snapshot_path_uses_unsuffixed_when_free / _falls_back_to_
suffix_on_collision (locks the suffix-collision branch behavior)
- disabled_flag_suppresses_snapshots (covers the env-var path via
a test-only setter)
- delegates_have_disjoint_snapshot_histories (locks per-delegate
isolation when two delegates use the same SecretsId)
- sleep(2ms) → 5ms with explanatory comment on slow-CI tolerance
Total new tests: 4 unit + 2 integration. Full lib suite: 2552 passed.
[AI-assisted - Claude]
Addresses the claude-rule-review bot finding on PR #4034: the `keep_last` clause was a permanent exemption from snapshot GC with no TTL or absolute-age override. Per AGENTS.md "WHEN writing cleanup/GC logic": cleanup exemptions MUST be time-bounded. `RetentionPolicy` now carries `max_age: Option<Duration>`, applied as a final filter in `select_keep`: any otherwise-selected entry older than `now - max_age` is dropped. Default is `Some(2 years)` — clear headroom above the month bucket (12 months) so the bucket retention is unaffected, while stale ciphertext from secrets the user stopped writing eventually ages out instead of lingering forever. `None` preserves the old behavior for callers that want it. Also reorders `secret_snapshots.rs` per code-style.md (`imports → types → trait impls → functions`): `RetentionBucket`, `RetentionPolicy`, and their impls now precede `snapshot_dir_for`, `next_snapshot_path`, `thin_snapshots`, `parse_snapshot_stamp`. The bot flagged the prior ordering as a style violation. New tests: - `max_age_overrides_keep_last`: stale entries selected by `keep_last` are dropped — the load-bearing invariant for the cleanup-exemption rule. - `max_age_preserves_future_timestamps`: clock-skew safety, `duration_since` returning Err keeps the entry. - `max_age_drops_only_stale_entries`: mixed-age input behaves correctly under `keep_last: 10, max_age: 120s`. [AI-assisted - Claude]
The new lint level (rust-1.93.0) flags `let _ = result;` on `#[must_use]` types. Rebase onto current main pulled it in. The tmp-file cleanup on rename failure is intentionally best-effort (a stale tmp left behind is harmless debris), so handle the error explicitly and log at debug level rather than discarding silently. [AI-assisted - Claude]
b03b1db to
e6023b0
Compare
Update passRebased onto current main (48 commits) and addressed the outstanding feedback. Bot warningThe
3 new unit tests pin the load-bearing invariants:
Style noteReordered RebaseResolved one new clippy lint that arrived with the rebase ( Tests
Branch is now [AI-assisted - Claude] |
…crets PR #4034 added per-secret snapshot-on-write history but left no way to actually use it — users would have had to manually move files in `.snapshots/{secret_id}/` over the active path. This adds the read and restore APIs that the upcoming `fdev delegate snapshots` CLI (issue #4036) will wrap. API surface: - `SecretsStore::list_snapshots(delegate, key) -> Vec<SnapshotMetadata>`: enumerates snapshot history oldest-first. Empty Vec for a never-written secret (not an error). Returns timestamp_ms + suffix (for same-ms collisions) + path + size_bytes so callers can display history without needing to decrypt. - `SecretsStore::restore_snapshot(delegate, key, timestamp_ms)`: promotes a captured snapshot back to the active path through the same hard-link + atomic-rename discipline as `store_secret`, taking a fresh snapshot of the value being replaced first so the restore is itself reversible. Re-adds the secret to the ReDb index + in-memory map if a prior `remove_secret` had cleared them. Doesn't decrypt on restore: the snapshot is ciphertext written by the same delegate's cipher, so it remains decryptable by `get_secret` without any re-keying. Means restore is usable even when the cipher hasn't been registered yet (recovery scenario). `SecretStoreError::SnapshotNotFound { key, timestamp_ms }` is the new typed error variant so the CLI can render a precise message instead of a generic IO failure. CLI work split off per the issue's own guidance ("User-facing recovery UX has its own design questions ... worth its own review cycle"); see follow-up issue #4036. Re-exports `SnapshotMetadata` + `SecretStoreError` through `freenet::dev_tool` so the CLI consumer can land as a stacked PR. Tests: 5 new SecretsStore-level tests cover empty list, multi-write history ordering, happy-path restore + reversibility snapshot, SnapshotNotFound on unknown timestamp, and restore-after-remove re-populating the index. 4 new secret_snapshots tests cover the new parse_snapshot_name helper, list_snapshots sort order (including the None-before-Some(0) edge case that initially failed and exposed a real sort-key bug), collision-suffix disambiguation, and missing-dir returning empty. Refs: #4036 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem
Delegate secrets had no defense against accidental overwrite or buggy delegate updates:
store_secretwas last-write-wins with no history. A delegate that corrupted its own state had no recovery path; not even moments-ago data could be retrieved.remove_secrethad a real bug: it deleted the ciphertext file but left the ReDbsecrets_indexand the in-memorykey_to_secret_partmap pointing at the now-missing path. After removal, an index lookup would still claim the secret existed, whileget_secretwould return file-not-found. Stale entries accumulated forever and survived restarts.These both showed up in a code review of how delegates store data and secrets (request from Ian, 2026-05-04). Issue #3050 covers the broader "vault delegate" multi-device sync design — this PR is the local-only durability slice that makes sense to ship without waiting on that design.
Solution
Snapshot-on-write (default-on). Before each
store_secretoverwrites an existing file, the prior ciphertext is renamed to{delegate_dir}/.snapshots/{secret_id}/{epoch_ms:020}. After the new write lands, the snapshot directory is thinned per a tiered retention policy:Steady-state worst case ~62 snapshots per secret; under burst churn the policy collapses everything in the same slot to one entry. Rename (not hard-link) so
File::create's truncate-in-place cannot corrupt the snapshot inode. Snapshots inherit the delegate's cipher; without the cipher key the bytes are useless. Best-effort: I/O failures during snapshot or thinning never block the primary write — they just log.FREENET_DISABLE_SECRET_SNAPSHOTSenv override exists for ops who explicitly want the previous behavior.remove_secretfix. Now updates the ReDb index, the in-memory map, and removes the snapshot history. The integration test (remove_secret_clears_index_and_snapshots) was verified against the original buggy body — it fails withReDb index still contains removed secret hash, confirming the regression coverage is real.The two changes are in one PR because the
remove_secretregression test asserts on snapshot deletion as part of its post-conditions, so they can't land independently without test churn.Testing
7 unit tests for the retention algorithm in
secret_snapshots.rscover empty input, last-N-unconditional retention, dense-history thinning to the expected slot count, burst-in-single-slot collapse, default-policy steady-state cap (525,600 minute-snapshots → ≤70 retained), future timestamps treated as age-zero, and zero-count buckets being inert.6 integration tests in
secrets_store.rsagainst a real on-diskSecretsStore:second_write_snapshots_prior_value— secondstore_secretproduces exactly one snapshot whose ciphertext decrypts back to the prior plaintext.burst_writes_are_thinned— 50 rapid writes under a tight policy land in ≤4 snapshots.remove_secret_clears_index_and_snapshots— verifies ReDb index cleared, in-memory map cleared, snapshot dir gone,get_secretreturnsMissingSecret. Confirmed to fail on the original buggyremove_secretbody.remove_nonexistent_secret_is_noop— boundary case.first_write_creates_no_snapshot— boundary case.store_and_load— pre-existing happy-path test, still passes.cargo fmtclean. No new clippy warnings introduced (pre-existing warnings intracing.rsandsimulation_integration.rsare untouched). Full lib suite: 2546 passed, 0 failed.Out of scope (follow-ups)
fdev delegate export/importCLI for off-device backups (requires archive format design and cipher handling — separate PR).web/repo).[AI-assisted - Claude]