feat(producer): enforce planDir size cap (PLAN_TOO_LARGE)#814
Conversation
af0336e to
8074417
Compare
27a3d70 to
5f0239a
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: planDir size cap (PLAN_TOO_LARGE)
Clean. The 2 GB cap is well-justified against Lambda's 10 GB /tmp budget. Check runs after workDir cleanup to avoid double-counting. Config override via planDirSizeLimitBytes is smart for both testing and tighter deployments. Error class carries structured fields for workflow adapters. Walker handles vanishing files and skips symlinks. Test coverage is thorough — 7 cases. No concerns. LGTM.
vanceingalls
left a comment
There was a problem hiding this comment.
Defensive ceiling on planDir size. 2 GB cap is justified against Lambda's 10 GB /tmp budget, the constant is documented, the error class carries actionable fields, and the cap is configurable via DistributedRenderConfig.planDirSizeLimitBytes for tests + tighter-budget adapters. Small targeted PR done well.
Strengths
plan.ts:42-69—PlanTooLargeErrorcarriessizeBytes,limitBytes, AND a message that names the in-process renderer escape hatch + the v1.5 slicing roadmap. Adopters who hit it get actionable next steps without grepping source.plan.ts:71-76—formatBytesfor human-readable units in the error message. Trivial helper but the difference between2147483648 Band2.00 GiBin a Lambda log is real.planSizeCap.test.ts:170-213— the test formeasurePlanDirBytesdirectly (empty dir, recursive sum, symlink-ignored baseline) AND end-to-end viaplan()is the right coverage shape.
Findings
important — failed planDirs are not cleaned up on PLAN_TOO_LARGE (plan.ts:629-636)
const planDirBytes = measurePlanDirBytes(planDir);
if (planDirBytes > sizeLimitBytes) {
throw new PlanTooLargeError(planDirBytes, sizeLimitBytes);
}The cap fires AFTER freezePlan finishes — so the planDir is fully materialized on disk before the throw. On Lambda this is fine (the container teardown reclaims /tmp). On Cloud Run Jobs / Temporal workers / Kubernetes Jobs that persist disk across activity invocations, you leave a 2+ GB planDir on every PLAN_TOO_LARGE — workers fill their disk after enough banned plans and start failing healthy plans with ENOSPC.
Two fixes:
- Wrap the cap check in a try/finally that
rmSync(planDir, { recursive: true, force: true })on the throw path. - Pre-flight: estimate planDir size BEFORE the expensive stages (video extract is the dominant cost). Reject early when the estimate exceeds the cap. Better diagnostic for the user (faster fail) AND no leaked planDir.
I'd take (1) as the immediate fix and queue (2) as a follow-up — (1) is mechanical, (2) requires estimating extract-stage output size before running it.
important — the cap check runs AFTER the work tree cleanup, missing the worst-case peak (plan.ts:619-636)
The comment at line 631 says "The check runs AFTER cleanup so the workDir tree doesn't double-count." That's correct for the planDir-size cap, but the peak disk usage during plan() is planDir + .plan-work simultaneously — and that peak CAN exceed Lambda's 10 GB /tmp even when the post-cleanup planDir is under 2 GB. Example: 1.8 GB planDir + 1.8 GB .plan-work during the rename pass = 3.6 GB peak, well under 10 GB, fine. But: 1.9 GB planDir + 1.9 GB transient .plan-work = 3.8 GB peak on a planDir that passes the cap.
The 2 GB cap doesn't enforce what it claims to enforce (fits in 10 GB Lambda /tmp): it enforces post-cleanup size. The peak budget is 2 * limit in the worst case. Either:
- Document this explicitly: "cap is on post-cleanup planDir size; transient peak during plan() is up to 2×limit"
- Tighten the cap to 1 GB (so 2×limit = 2 GB peak fits comfortably) — but that's a real usability regression
- Use
os.statfs(/tmp)BEFORE plan() to check available bytes, throw early ifavailable < estimated_peak
I'd take "document" as the minimum cleanup since the alternatives are larger; flag for the v1.5 slicing work.
nit — measurePlanDirBytes silently swallows readdir/stat errors (plan.ts:88-107)
try {
entries = readdirSync(dir, { withFileTypes: true });
} catch {
return;
}The walker silently skips any directory it can't read. Failure mode: a permissions issue on a subtree underflows the size measurement, and a planDir that's actually 3 GB looks like it's 1 GB. The cap doesn't fire and the caller ships a planDir that breaks the chunk worker downstream. At minimum, log the error so it surfaces in adapter logs; better, throw on EACCES and only swallow ENOENT (file disappeared during the walk).
nit — the test "ignores symlinks" doesn't actually test symlink ignoring (planSizeCap.test.ts:204-213)
it("ignores symlinks (not traversed into)", () => {
const dir = mkdtempSync(join(runRoot, "symlinks-"));
writeFileSync(join(dir, "real.bin"), Buffer.alloc(128));
// We don't actually create a symlink here because the planDir
// materialization path strips them — but the function should still
// gracefully ignore broken entries if any slipped in. Confirm the
// baseline is correct (the real file's bytes).
expect(measurePlanDirBytes(dir)).toBe(128);
});The test name promises "ignores symlinks (not traversed into)" but the body acknowledges no symlink is created. This is the inverse of test value — the test passes because it's measuring a real file, not because the implementation correctly handles symlinks. Either:
- Create an actual symlink with
symlinkSyncand assert the size matches the non-symlinked baseline (proves they're skipped) - Rename the test to honestly describe what it asserts ("baseline measurement on a non-symlinked tree")
nit — PLAN_DIR_SIZE_LIMIT_BYTES = 2 * 1024 * 1024 * 1024 (plan.ts:42) — name doesn't match value
The constant says "BYTES" but the value is GiB-flavored (1024-based). Conventional naming would be PLAN_DIR_SIZE_LIMIT_BINARY_BYTES or PLAN_DIR_SIZE_LIMIT_GIB = 2. Minor — but adapter authors who do LAMBDA_TMP_GB * 1_000_000_000 will be off by 7% in either direction.
Verdict: APPROVE
Reasoning: The cap mechanics are correct, the typed error model matches the §9.3 contract from the design doc, and the tests cover the happy path + override. The disk-cleanup-on-throw is worth a follow-up commit but not a blocker — Lambda (the primary target) cleans up on container teardown, so the failure mode is only sharp on long-running worker deployments.
— Vai
5f0239a to
819a06c
Compare
8074417 to
24f0ffe
Compare
vanceingalls
left a comment
There was a problem hiding this comment.
Re-review
PR rebased onto the new #809 base. The plan-size cap addition (PLAN_DIR_SIZE_LIMIT_BYTES = 2GiB, PlanTooLargeError, measurePlanDirBytes walking the planDir post-freeze) is identical to what I previously approved at 5f0239a. The bytes-formatter and walk skip symlinks per spec.
Net-new findings
None.
Verdict: APPROVE
Reasoning: No content change since prior approval; cap and error class are well-scoped. CI required checks green.
— Vai
24f0ffe to
d0cd7f5
Compare
dbbd329 to
6aca0b4
Compare
35239ea to
b606106
Compare
6aca0b4 to
8590f07
Compare
8590f07 to
e813fcb
Compare
The base branch was changed.
e813fcb to
fbbd41a
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-stamping after rebase. LGTM.
Merge activity
|
…4) (#815) ## What Phase 3 of the distributed rendering plan: §11 PR 3.5 format banlist. Extends `plan()` to refuse two v1-unsupported formats up front with a typed non-retryable `FormatNotSupportedInDistributedError` (`code === "FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED"`). ## Why Both webm and HDR mp4 are documented as deferred to v1.5 (§7.2 + §12), but until this PR the only signal at the runtime layer is the in-process pipeline silently producing wrong output (chunk concat-copy doesn't round-trip VP9; HDR signaling gets stripped at the chunk boundary). Failing fast at `plan()` time keeps adopters from spending fan-out compute on a render that can't succeed and gives them a typed error code their workflow adapter can route on. ## How - New exports in `services/distributed/plan.ts`: - `FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED` — non-retryable error code matching §11's wording. - `FormatNotSupportedInDistributedError` — typed error class with `code`, `format`, and `reason` fields. Message names the rejected format and tells adopters to fall back to the in-process renderer (`executeRenderJob`) which has full format support. - `rejectUnsupportedDistributedFormat(config)` — pure helper exported separately so adapters can run the same gate at their input layer (Step Functions input validation, Temporal workflow start) before the activity even runs. - `plan()` calls `rejectUnsupportedDistributedFormat(config)` as the first line of the function — BEFORE `mkdirSync(planDir)` so a banned input never produces a partial planDir. - Replaced the previous ad-hoc `if (hdrMode === "force-hdr") throw new Error(...)` with the typed error class. ### What did NOT change `executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. The in-process renderer continues to accept webm + HDR (its existing functionality). ## Test plan - [x] Unit tests added — `packages/producer/src/services/distributed/planFormatBanlist.test.ts`. 5 cases: - `rejectUnsupportedDistributedFormat` accepts the v1-supported formats (mp4, mov, png-sequence) with both `auto` and `force-sdr` hdrMode. - Rejects webm — error has `code === FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED`, `format === "webm"`, message mentions in-process renderer. - Rejects HDR mp4 (`hdrMode === "force-hdr"`) — error has `format === "mp4-hdr"`, message mentions HDR. - End-to-end via `plan()`: webm throws with no planDir leaking to disk. - End-to-end via `plan()`: HDR mp4 throws with no planDir leaking to disk. - [x] `bun test packages/producer/src/services/distributed/` — 30 pass. - [x] `bun run --filter @hyperframes/producer typecheck` — clean. - [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files. - [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold. This is PR 5 of a 6-PR Phase 3 stack: - 3.1 — `services/distributed/plan.ts` (#808) - 3.2 — `services/distributed/renderChunk.ts` (#809) - 3.3 — `services/distributed/assemble.ts` (#813) - 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) (#814) - **3.5 (this PR)** — distributed format banlist (webm + HDR mp4) - 3.6 — public exports + `@hyperframes/producer/distributed` subpath 🤖 Generated with [Claude Code](https://claude.com/claude-code)

What
Phase 3 of the distributed rendering plan: §6.4 / §9.3 size cap. Extends
plan()to measure the produced planDir's total byte size after freeze and throw a typed non-retryablePlanTooLargeError(code === "PLAN_TOO_LARGE") when the planDir exceeds 2 GB.Why
Distributed chunk workers ship the entire planDir to whatever ephemeral storage they're running on —
/tmpon AWS Lambda (10 GB), the container filesystem on Cloud Run Jobs, etc. A planDir that doesn't fit can't be rendered. v1.5 lifts this cap via per-chunk video-frame slicing (§12); for now v1 fails fast at plan time so adapters don't waste a fan-out attempt that's guaranteed to OOM.The 2 GB ceiling specifically targets Lambda's 10 GB
/tmp: planDir + per-chunk captured frames + ffmpeg's working set all share that budget, and 2 GB leaves ~8 GB for capture/encode at 4K SDR.How
services/distributed/plan.ts:PLAN_DIR_SIZE_LIMIT_BYTES— the 2 GB constant.PLAN_TOO_LARGE— the non-retryable error code (matches §9.3).PlanTooLargeError— typed error class carryingcode,sizeBytes,limitBytes, and a message that points adopters at the v1.5 slicing roadmap + the in-process renderer escape hatch.measurePlanDirBytes(planDir)— recursive on-disk size walker. Symlinks skipped intentionally.DistributedRenderConfig.planDirSizeLimitBytes?: number— optional override. Defaults toPLAN_DIR_SIZE_LIMIT_BYTES. Tests pass a tiny cap (1024 bytes) to exercise the throw path without filling 2 GB of /tmp.plan()AFTER the temp work tree is removed (so.plan-work/doesn't double-count) but BEFORE the function returns — adapters that catch the error never see aPlanResult.What did NOT change
executeRenderJob, the in-process orchestrator, thehyperframes renderCLI, producer HTTP routes — all unchanged. Onlyplan()(which is itself opt-in) enforces the cap.Test plan
packages/producer/src/services/distributed/planSizeCap.test.ts. 7 cases:measurePlanDirBytesreturns 0 for an empty dir, sums recursively, and gracefully ignores broken entries.PLAN_DIR_SIZE_LIMIT_BYTESis2 * 1024 * 1024 * 1024(§6.4 pin).PlanTooLargeErrorcarries thePLAN_TOO_LARGEcode +sizeBytes+limitBytesand mentions the v1.5 escape hatch.plan()throwsPlanTooLargeErrorwhen configured with a 1024-byte ceiling.plan()succeeds when the default 2 GB ceiling is well above the produced planDir.bun test packages/producer/src/services/distributed/— 25 pass (PRs 3.1 + 3.2 + 3.3 + 3.4).bun run --filter @hyperframes/producer typecheck— clean.bunx oxlint+bunx oxfmt --check— clean on changed files.executeRenderJobis unchanged; PSNR baselines should hold.This is PR 4 of a 6-PR Phase 3 stack:
services/distributed/plan.ts(feat(producer): add services/distributed/plan.ts #808)services/distributed/renderChunk.ts(feat(producer): add services/distributed/renderChunk.ts #809)services/distributed/assemble.ts(feat(producer): add services/distributed/assemble.ts #813)planDirsize cap (PLAN_TOO_LARGE)@hyperframes/producer/distributedsubpath🤖 Generated with Claude Code