feat: freenet-release-agent (Phase 1, dry-run only)#4082
Conversation
|
Now I have enough context to write the review. Rule Review: Clean with minor style notesRules checked: WarningsNone. The diff is clean against the blocking criteria:
Info
Rule review against |
Tier 1 (production-safety footguns): - Config::dry_run now defaults to true (was false). Missing or partial config can never silently enable live mode. - Config gains #[serde(deny_unknown_fields)]. A typo like `dry-run = false` (with a dash) is now rejected at load time instead of being silently dropped, which combined with the previous default would have flipped a gateway live on misconfig. - Rate-limit window is no longer consumed before the GitHub `latest` cross-check. A transient GitHub 5xx, an immediate sudo failure, or any other pre-spawn error returns the appropriate non-2xx without burning the legitimate retry's cooldown. Bookkeeping is now done under a single mutex held across spawn and only commits on success. - Updater::run waits up to 1s after spawn for an immediate non-zero exit (sudoers misconfig, sudo rejection, script-not-executable). Surfaces as 500 to the caller instead of a false-positive 202 that would also rate-limit the next retry. - Signed version is now pinned through to the script via a new --target-version flag on gateway-auto-update.sh. Closes the race where the agent's GitHub check and the script's own fetch could see different "latest" values if a new release was published between them. - check_clock_skew rejects non-positive issued_at and uses checked arithmetic, so i64::MIN no longer panics in `.abs()`. - POST /update is capped at 4 KiB via DefaultBodyLimit. Legit body is ~60 bytes. Tier 2 (cheap quality wins): - HMAC secret is always parsed as hex (matches install.sh's `openssl rand -hex 32`). Dropped the hex-vs-raw heuristic that could silently halve a raw 32-byte key. - HEADER_SIGNATURE is an axum HeaderName constant, not &str. - current_version is cached by binary mtime. Previously the agent forked freenet --version on every /version and /update — a cheap self-DoS vector. - Updater stdout/stderr inherit the agent's fds (journald under systemd) instead of /dev/null, so script failures show up in `journalctl -u freenet-release-agent`. - LatestSource trait + StaticLatest test double make the security-critical `target == GitHub latest` check actually testable. Previously the only way to exercise that branch was a real network call. - Removed dead `ReadWritePaths=/var/log/freenet-release-agent` from the systemd unit and the matching mkdir from install.sh; the agent only logs to journald. - README documents response semantics (200+no_op, 202, 401/403/429/ 502/500) so workflow callers can interpret responses correctly. - install.sh now refuses to install if /usr/local/bin/ gateway-auto-update.sh isn't owned by root or has perms > 0755 (otherwise the sudoers entry is meaningless). - Sudoers entry expanded to permit `--force --target-version *`, matching the new wire-up. - 11-test integration suite in tests/update_handler.rs covers HMAC, replay, version-pin, no-op, downgrade, rate-limit, /version, and /healthz end-to-end through the real axum router. Was 0 coverage before. Refs: #4073 PR: #4082
Multi-model review responsesFive reviewers (4 internal + Codex CLI) returned 30-ish findings across security, testing, and big-picture alignment. Consolidated punch list categorized into Tier 1 (must-fix in this PR), Tier 2 (cheap quality wins), Tier 3 (follow-ups tracked in #4103). Tier 1 — fixed in commit
|
| # | Finding | Source | Fix |
|---|---|---|---|
| 1 | Config::dry_run defaulted to false in code while example TOML said true. Partial config silently went live. |
big-picture | Defaults to true now |
| 2 | Config lacked deny_unknown_fields. dry-run = false (hyphen typo) silently used the default. |
code-first | #[serde(deny_unknown_fields)]; test pins the rejection |
| 3 | Rate-limit cooldown started before the GitHub check, so a transient 5xx burned the legitimate 10-min retry window. | Codex P1, skeptical CRIT #2, code-first | Moved bookkeeping to after spawn-success. Mutex held across spawn so the failure paths don't consume the window. |
| 4 | Updater returned Ok the instant sudo spawned; misconfigured sudoers looked like success while nothing happened. |
Codex P1, testing | 1s synchronous probe for immediate non-zero exit → surfaced as 500. |
| 5 | Signed version not pinned through to the script — race window between agent's GitHub check and script's own fetch. | Codex P2 | New --target-version flag on gateway-auto-update.sh; agent passes v{version}. Sudoers expanded to allow the new arg. |
| 6 | issued_at: i64::MIN overflowed .abs(). |
skeptical HIGH #3 | Reject non-positive issued_at; checked arithmetic. Tests pin both. |
| 7 | Body size unbounded. | skeptical MED #9, code-first | DefaultBodyLimit::max(4096). |
Tier 2 — also fixed in a41ce688
| # | Finding | Source | Fix |
|---|---|---|---|
| 8 | HMAC secret hex-vs-raw heuristic could silently halve a raw key whose bytes were all hex digits. | code-first | Always-hex; rejects non-hex with a clear error. |
| 9 | HEADER_SIGNATURE as &str not HeaderName. |
code-first | HeaderName::from_static. |
| 10 | current_version forked freenet --version on every request — self-DoS surface. |
skeptical MED #10 | VersionCache keyed on binary mtime; test asserts second call doesn't respawn. |
| 11 | Updater stdout/stderr → /dev/null, debugging blind spot. |
code-first | Stdio::inherit() so output lands in journald via the systemd unit. |
| 12 | Dead ReadWritePaths=/var/log/freenet-release-agent in systemd unit. |
code-first | Removed. install.sh also no longer creates the unused dir. |
| 13 | fetch_latest_version silently excludes pre-releases. |
code-first | Now documented on GitHubLatest. |
| 14 | no_op: true contract not documented for callers. |
code-first | Rustdoc on UpdateResponse + Response-semantics table in README. |
| 15 | No integration test of update_handler; LatestSource not mockable. |
testing | New LatestSource trait + StaticLatest test double. 11-test integration suite in tests/update_handler.rs covers HMAC / replay / version-pin / no-op / downgrade / rate-limit / version / healthz end-to-end. |
| Bonus | install.sh doesn't verify gateway-auto-update.sh privilege boundary. |
skeptical (script writable by freenet-update would void sudoers) |
install.sh now refuses to install if the script isn't owned by root or has perms > 0755. |
Tier 3 — tracked in #4103
These need design discussion or are larger than this PR:
- Persistent replay protection (skeptical CRIT Overall architecture RFC #1) — current ±5min skew + in-memory rate limit is replayable across agent restarts.
- Artifact signature verification (skeptical HIGH Peer resource usage balancing #4) — cosign/minisign on release tarballs; current trust model assumes GitHub
latestis integrity-authoritative. - GitHub
latestcaching (skeptical HIGH Intelligent routing #5) — anonymous API quota DoS vector. - Status-sentinel file (skeptical HIGH Implement join ring op #8) — last-update outcome readable post-restart.
- HMAC secret via
LoadCredentialEncrypted(skeptical HIGH Social credit #6) — group-readable file is a footgun. cross-compile.ymlentry for the agent (big-picture) — needed before Phase 3.- AGENTS.md + skill forward-pointer (big-picture) — docs.
Test counts
- Before: 14 unit tests in 4 modules.
- After: 22 unit tests + 11 integration tests (auth/HMAC overflow/non-positive issued_at, config deny_unknown_fields, version cache, full /update pipeline).
Verified after the fix round
cargo fmt --check, cargo clippy --all-targets -- -D warnings, and cargo test -p freenet-release-agent all clean.
[AI-assisted - Claude]
Adds two purely-additive companions to PR #4082 so the deploy/rollout steps don't live only in chat: - docs/release-agent-deployment.md captures the parts install.sh can't automate (DNS, AWS SG sizing for vega, secret rotation, dry-run acceptance checklist) so the next person installing the agent doesn't have to reconstruct them. - .github/workflows/gateway-update.yml is a workflow_dispatch-only draft of the Phase 2 rollout job. NOT triggered by release.yml or cross-compile.yml — it can only fire when a human clicks Run, and fails fast if the RELEASE_AGENT_HMAC_* secrets aren't set. Lets the workflow get reviewed in PR before any production wiring exists. Both are reversible (delete-and-forget) and touch zero production behavior. Phase 1 dry-run validation on nova is still the next real step; this PR is just text. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pushed
Both reversible. Phase 1 dry-run on nova is still the next real-machine step — this commit is just text. [AI-assisted - Claude] |
Adds an HTTP service that lets a future GitHub Actions release workflow trigger gateway updates without an SSH key. Listens on loopback, verifies an HMAC-signed POST body, cross-checks the requested version against GitHub's `latest` release (so a leaked token cannot install an arbitrary version), then either logs the decision (dry-run) or shells out to the existing `gateway-auto-update.sh --force` via a narrow sudoers entry. Phase 1 ships only the agent + deploy artifacts; everything defaults to dry-run. The CI workflow that calls into the agent and the live deployment on nova/vega are tracked as later phases in #4073. Refs: #4073
Tier 1 (production-safety footguns): - Config::dry_run now defaults to true (was false). Missing or partial config can never silently enable live mode. - Config gains #[serde(deny_unknown_fields)]. A typo like `dry-run = false` (with a dash) is now rejected at load time instead of being silently dropped, which combined with the previous default would have flipped a gateway live on misconfig. - Rate-limit window is no longer consumed before the GitHub `latest` cross-check. A transient GitHub 5xx, an immediate sudo failure, or any other pre-spawn error returns the appropriate non-2xx without burning the legitimate retry's cooldown. Bookkeeping is now done under a single mutex held across spawn and only commits on success. - Updater::run waits up to 1s after spawn for an immediate non-zero exit (sudoers misconfig, sudo rejection, script-not-executable). Surfaces as 500 to the caller instead of a false-positive 202 that would also rate-limit the next retry. - Signed version is now pinned through to the script via a new --target-version flag on gateway-auto-update.sh. Closes the race where the agent's GitHub check and the script's own fetch could see different "latest" values if a new release was published between them. - check_clock_skew rejects non-positive issued_at and uses checked arithmetic, so i64::MIN no longer panics in `.abs()`. - POST /update is capped at 4 KiB via DefaultBodyLimit. Legit body is ~60 bytes. Tier 2 (cheap quality wins): - HMAC secret is always parsed as hex (matches install.sh's `openssl rand -hex 32`). Dropped the hex-vs-raw heuristic that could silently halve a raw 32-byte key. - HEADER_SIGNATURE is an axum HeaderName constant, not &str. - current_version is cached by binary mtime. Previously the agent forked freenet --version on every /version and /update — a cheap self-DoS vector. - Updater stdout/stderr inherit the agent's fds (journald under systemd) instead of /dev/null, so script failures show up in `journalctl -u freenet-release-agent`. - LatestSource trait + StaticLatest test double make the security-critical `target == GitHub latest` check actually testable. Previously the only way to exercise that branch was a real network call. - Removed dead `ReadWritePaths=/var/log/freenet-release-agent` from the systemd unit and the matching mkdir from install.sh; the agent only logs to journald. - README documents response semantics (200+no_op, 202, 401/403/429/ 502/500) so workflow callers can interpret responses correctly. - install.sh now refuses to install if /usr/local/bin/ gateway-auto-update.sh isn't owned by root or has perms > 0755 (otherwise the sudoers entry is meaningless). - Sudoers entry expanded to permit `--force --target-version *`, matching the new wire-up. - 11-test integration suite in tests/update_handler.rs covers HMAC, replay, version-pin, no-op, downgrade, rate-limit, /version, and /healthz end-to-end through the real axum router. Was 0 coverage before. Refs: #4073 PR: #4082
Adds two purely-additive companions to PR #4082 so the deploy/rollout steps don't live only in chat: - docs/release-agent-deployment.md captures the parts install.sh can't automate (DNS, AWS SG sizing for vega, secret rotation, dry-run acceptance checklist) so the next person installing the agent doesn't have to reconstruct them. - .github/workflows/gateway-update.yml is a workflow_dispatch-only draft of the Phase 2 rollout job. NOT triggered by release.yml or cross-compile.yml — it can only fire when a human clicks Run, and fails fast if the RELEASE_AGENT_HMAC_* secrets aren't set. Lets the workflow get reviewed in PR before any production wiring exists. Both are reversible (delete-and-forget) and touch zero production behavior. Phase 1 dry-run validation on nova is still the next real step; this PR is just text. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Main grew a second `reqwest` (0.13.3) via the recent opentelemetry 0.32
bumps. PR-merge CI tests the merged SHA, so it must qualify which
reqwest the `freenet-release-agent` lockfile entry depends on:
- "reqwest",
+ "reqwest 0.12.28",
Pure mechanical refresh; no behavior change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Multi-model review (4 internal + Codex CLI) surfaced several findings. Tier 1 (blocking) and Tier 2 (high-risk) addressed here; Tier 3 stays in #4103. Production-safety: - systemd: ProtectSystem=strict blocked the sudo'd child from writing /usr/local/bin/freenet — Phase 2 would have failed on first real update. Switched to ProtectSystem=off and RestrictNamespaces=false with comments explaining why the privilege boundary is sudoers, not the systemd sandbox. (Codex P1) - gateway-update.yml: `max-parallel` was at job level instead of `strategy.max-parallel` — workflow was invalid YAML, couldn't be dispatched. (Codex P2) - server.rs: mutex acquire wrapped in tokio::time::timeout(5s) so a stuck spawn can't deadlock all future /update requests indefinitely. Returns 503 on timeout. (Skeptical H1) - sudoers.freenet-release-agent: comment claimed `*` matches "a single non-whitespace argument" — false per man 5 sudoers, `*` matches any characters including whitespace. Fixed comment to explain the safety comes from two independent semver regex checks (agent semver::parse, then bash regex in gateway-auto-update.sh), not sudoers alone. (Code-first + Skeptical H2) - gateway-auto-update.sh: added strict semver regex validation of --target-version after the `${2#v}` strip — defense in depth against the sudoers wildcard. (Code-first) - version.rs: VersionCache keyed on mtime alone would return stale data forever if the update script used `cp -p` (mtime preserved on rename). Added size to the cache key + a regression test that preserves mtime and changes size. (Skeptical M5, rule-review WARNING) - docs/release-agent-deployment.md: claimed "no in-memory cache" for the HMAC secret, but main.rs reads it once at startup. Fixed the rotation procedure to require `systemctl restart`. (Skeptical L4) - config.rs: clock_skew_tolerance_seconds changed i64 → u32 so a misconfigured negative value can't accidentally reject every request. (Skeptical L3) Test coverage: - spawn_failure_does_not_consume_rate_limit_window — pins the security contract that an immediate sudo failure (exit 1 within 1s probe) must NOT consume the rate-limit window. (Testing high-risk) - sudoers_argv_matches_allowlist — fake-sudo records the agent's argv and asserts it matches the sudoers entry's prefix. Future refactors that reorder flags will fail this test before reaching production. (Testing high-risk) - oversized_body_rejected — pins the 4KiB body limit. (Code-first) - cache_invalidates_when_binary_changes — regression test for the size-keyed cache fix above. - Added a build_live() variant on the test Harness that injects a fake-sudo via the new Updater.sudo_command field, so the spawn pipeline can be exercised without real root. Docs: - AGENTS.md: added crates/release-agent/ to Repository Structure. - scripts/release.sh: future-migration comment on trigger_gateway_updates pointing at .github/workflows/gateway-update.yml and #4073. - gateway-auto-update.sh: usage block documents --target-version. Workflow: - gateway-update.yml: added comment documenting the split-version risk of fail-fast + max-parallel:1 on partial-failure rollout. Test totals: 37 (23 unit + 14 integration). cargo fmt + clippy --all-targets -D warnings clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pushed Tier 1 — production-safety (blocking)
Tier 2 — test coverage (high-risk gaps)
Tier 2 — doc drift
Test totals37 (23 unit + 14 integration). Not addressed in this PRTier 3 follow-ups in #4103: replay nonce store, artifact-signature verification, cross-compile entry, GitHub-latest caching, status sentinel,
Will append these to #4103. [AI-assisted - Claude] |
Problem
The current Freenet release pipeline (
scripts/release.sh) ssh's into nova and vega asianto invokegateway-auto-update.sh --force. That ties the release process to a single workstation's SSH key and gives runners shell on the gateway boxes if we move it to CI. We want a narrow, signed HTTP path instead — see #4073 for the full design.The peer-side context for why prompt gateway updates matter: peers don't poll GitHub for new releases on a timer. They detect a higher-versioned peer (in practice, the gateway), set a version-mismatch flag, and only then verify against GitHub before exiting with code 42 to be auto-updated. That makes the gateway the version oracle for the whole network — a release that's on crates.io but not on the gateways is invisible to peers (
crates/core/src/bin/commands/auto_update.rs:1-14).Solution
A small Rust HTTP service (
freenet-release-agent):freenet-updatesystem user127.0.0.1:9876; Caddy fronts it atupdate.<host>.locut.uswith TLSGET /versionreadsfreenet --versionPOST /updaterequires:X-Signatureheader)issued_atwithin ±5 minutes of agent clock (replay protection)latestrelease (defense in depth: a leaked token cannot install an arbitrary version)dry_run = falsethe agent shells out tosudo --non-interactive /usr/local/bin/gateway-auto-update.sh --force. sudoers grants NOPASSWD only for that exact command line.The agent intentionally does not contain any of the update logic itself — it's a thin shim around the existing
gateway-auto-update.sh. The systemd timer that polls GitHub every 10 minutes also remains in place as a fallback, so an agent failure degrades to a ~10-minute delay rather than a release-breaking event.Testing
v-prefixed / pre-release / garbage), TOML config parsing, and dry-run-does-not-execute.cargo fmt --checkclean,cargo clippy --all-targets -- -D warningsclean.Out of scope (deferred to later phases of #4073)
release.ymlworkflow that calls both agents + Matrix/River announcements moved to CIrelease.sh's SSH stage; thefreenet:releaseskill becomes "trigger workflow_dispatch and watch"Refs
#4073
[AI-assisted - Claude]