Skip to content

feat(producer): enforce planDir size cap (PLAN_TOO_LARGE)#814

Merged
jrusso1020 merged 1 commit into
mainfrom
05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large
May 14, 2026
Merged

feat(producer): enforce planDir size cap (PLAN_TOO_LARGE)#814
jrusso1020 merged 1 commit into
mainfrom
05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 13, 2026

What

Phase 3 of the distributed rendering plan: §6.4 / §9.3 size cap. Extends plan() to measure the produced planDir's total byte size after freeze and throw a typed non-retryable PlanTooLargeError (code === "PLAN_TOO_LARGE") when the planDir exceeds 2 GB.

Why

Distributed chunk workers ship the entire planDir to whatever ephemeral storage they're running on — /tmp on AWS Lambda (10 GB), the container filesystem on Cloud Run Jobs, etc. A planDir that doesn't fit can't be rendered. v1.5 lifts this cap via per-chunk video-frame slicing (§12); for now v1 fails fast at plan time so adapters don't waste a fan-out attempt that's guaranteed to OOM.

The 2 GB ceiling specifically targets Lambda's 10 GB /tmp: planDir + per-chunk captured frames + ffmpeg's working set all share that budget, and 2 GB leaves ~8 GB for capture/encode at 4K SDR.

How

  • New exports in services/distributed/plan.ts:
    • PLAN_DIR_SIZE_LIMIT_BYTES — the 2 GB constant.
    • PLAN_TOO_LARGE — the non-retryable error code (matches §9.3).
    • PlanTooLargeError — typed error class carrying code, sizeBytes, limitBytes, and a message that points adopters at the v1.5 slicing roadmap + the in-process renderer escape hatch.
    • measurePlanDirBytes(planDir) — recursive on-disk size walker. Symlinks skipped intentionally.
  • DistributedRenderConfig.planDirSizeLimitBytes?: number — optional override. Defaults to PLAN_DIR_SIZE_LIMIT_BYTES. Tests pass a tiny cap (1024 bytes) to exercise the throw path without filling 2 GB of /tmp.
  • The check runs in plan() AFTER the temp work tree is removed (so .plan-work/ doesn't double-count) but BEFORE the function returns — adapters that catch the error never see a PlanResult.

What did NOT change

executeRenderJob, the in-process orchestrator, the hyperframes render CLI, producer HTTP routes — all unchanged. Only plan() (which is itself opt-in) enforces the cap.

Test plan

  • Unit tests added — packages/producer/src/services/distributed/planSizeCap.test.ts. 7 cases:
    • measurePlanDirBytes returns 0 for an empty dir, sums recursively, and gracefully ignores broken entries.
    • PLAN_DIR_SIZE_LIMIT_BYTES is 2 * 1024 * 1024 * 1024 (§6.4 pin).
    • PlanTooLargeError carries the PLAN_TOO_LARGE code + sizeBytes + limitBytes and mentions the v1.5 escape hatch.
    • plan() throws PlanTooLargeError when configured with a 1024-byte ceiling.
    • plan() succeeds when the default 2 GB ceiling is well above the produced planDir.
  • bun test packages/producer/src/services/distributed/ — 25 pass (PRs 3.1 + 3.2 + 3.3 + 3.4).
  • bun run --filter @hyperframes/producer typecheck — clean.
  • bunx oxlint + bunx oxfmt --check — clean on changed files.
  • Producer Docker regression harness — pending CI. executeRenderJob is unchanged; PSNR baselines should hold.

This is PR 4 of a 6-PR Phase 3 stack:

🤖 Generated with Claude Code

@jrusso1020 jrusso1020 changed the title feat(producer): enforce planDir size cap with PLAN_TOO_LARGE feat(producer): enforce planDir size cap (PLAN_TOO_LARGE) May 13, 2026
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch from af0336e to 8074417 Compare May 13, 2026 23:02
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch from 27a3d70 to 5f0239a Compare May 13, 2026 23:02
miguel-heygen
miguel-heygen previously approved these changes May 13, 2026
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: planDir size cap (PLAN_TOO_LARGE)

Clean. The 2 GB cap is well-justified against Lambda's 10 GB /tmp budget. Check runs after workDir cleanup to avoid double-counting. Config override via planDirSizeLimitBytes is smart for both testing and tighter deployments. Error class carries structured fields for workflow adapters. Walker handles vanishing files and skips symlinks. Test coverage is thorough — 7 cases. No concerns. LGTM.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defensive ceiling on planDir size. 2 GB cap is justified against Lambda's 10 GB /tmp budget, the constant is documented, the error class carries actionable fields, and the cap is configurable via DistributedRenderConfig.planDirSizeLimitBytes for tests + tighter-budget adapters. Small targeted PR done well.

Strengths

  • plan.ts:42-69PlanTooLargeError carries sizeBytes, limitBytes, AND a message that names the in-process renderer escape hatch + the v1.5 slicing roadmap. Adopters who hit it get actionable next steps without grepping source.
  • plan.ts:71-76formatBytes for human-readable units in the error message. Trivial helper but the difference between 2147483648 B and 2.00 GiB in a Lambda log is real.
  • planSizeCap.test.ts:170-213 — the test for measurePlanDirBytes directly (empty dir, recursive sum, symlink-ignored baseline) AND end-to-end via plan() is the right coverage shape.

Findings

important — failed planDirs are not cleaned up on PLAN_TOO_LARGE (plan.ts:629-636)

const planDirBytes = measurePlanDirBytes(planDir);
if (planDirBytes > sizeLimitBytes) {
  throw new PlanTooLargeError(planDirBytes, sizeLimitBytes);
}

The cap fires AFTER freezePlan finishes — so the planDir is fully materialized on disk before the throw. On Lambda this is fine (the container teardown reclaims /tmp). On Cloud Run Jobs / Temporal workers / Kubernetes Jobs that persist disk across activity invocations, you leave a 2+ GB planDir on every PLAN_TOO_LARGE — workers fill their disk after enough banned plans and start failing healthy plans with ENOSPC.

Two fixes:

  1. Wrap the cap check in a try/finally that rmSync(planDir, { recursive: true, force: true }) on the throw path.
  2. Pre-flight: estimate planDir size BEFORE the expensive stages (video extract is the dominant cost). Reject early when the estimate exceeds the cap. Better diagnostic for the user (faster fail) AND no leaked planDir.

I'd take (1) as the immediate fix and queue (2) as a follow-up — (1) is mechanical, (2) requires estimating extract-stage output size before running it.

important — the cap check runs AFTER the work tree cleanup, missing the worst-case peak (plan.ts:619-636)

The comment at line 631 says "The check runs AFTER cleanup so the workDir tree doesn't double-count." That's correct for the planDir-size cap, but the peak disk usage during plan() is planDir + .plan-work simultaneously — and that peak CAN exceed Lambda's 10 GB /tmp even when the post-cleanup planDir is under 2 GB. Example: 1.8 GB planDir + 1.8 GB .plan-work during the rename pass = 3.6 GB peak, well under 10 GB, fine. But: 1.9 GB planDir + 1.9 GB transient .plan-work = 3.8 GB peak on a planDir that passes the cap.

The 2 GB cap doesn't enforce what it claims to enforce (fits in 10 GB Lambda /tmp): it enforces post-cleanup size. The peak budget is 2 * limit in the worst case. Either:

  • Document this explicitly: "cap is on post-cleanup planDir size; transient peak during plan() is up to 2×limit"
  • Tighten the cap to 1 GB (so 2×limit = 2 GB peak fits comfortably) — but that's a real usability regression
  • Use os.statfs(/tmp) BEFORE plan() to check available bytes, throw early if available < estimated_peak

I'd take "document" as the minimum cleanup since the alternatives are larger; flag for the v1.5 slicing work.

nit — measurePlanDirBytes silently swallows readdir/stat errors (plan.ts:88-107)

try {
  entries = readdirSync(dir, { withFileTypes: true });
} catch {
  return;
}

The walker silently skips any directory it can't read. Failure mode: a permissions issue on a subtree underflows the size measurement, and a planDir that's actually 3 GB looks like it's 1 GB. The cap doesn't fire and the caller ships a planDir that breaks the chunk worker downstream. At minimum, log the error so it surfaces in adapter logs; better, throw on EACCES and only swallow ENOENT (file disappeared during the walk).

nit — the test "ignores symlinks" doesn't actually test symlink ignoring (planSizeCap.test.ts:204-213)

it("ignores symlinks (not traversed into)", () => {
  const dir = mkdtempSync(join(runRoot, "symlinks-"));
  writeFileSync(join(dir, "real.bin"), Buffer.alloc(128));
  // We don't actually create a symlink here because the planDir
  // materialization path strips them — but the function should still
  // gracefully ignore broken entries if any slipped in. Confirm the
  // baseline is correct (the real file's bytes).
  expect(measurePlanDirBytes(dir)).toBe(128);
});

The test name promises "ignores symlinks (not traversed into)" but the body acknowledges no symlink is created. This is the inverse of test value — the test passes because it's measuring a real file, not because the implementation correctly handles symlinks. Either:

  • Create an actual symlink with symlinkSync and assert the size matches the non-symlinked baseline (proves they're skipped)
  • Rename the test to honestly describe what it asserts ("baseline measurement on a non-symlinked tree")

nit — PLAN_DIR_SIZE_LIMIT_BYTES = 2 * 1024 * 1024 * 1024 (plan.ts:42) — name doesn't match value

The constant says "BYTES" but the value is GiB-flavored (1024-based). Conventional naming would be PLAN_DIR_SIZE_LIMIT_BINARY_BYTES or PLAN_DIR_SIZE_LIMIT_GIB = 2. Minor — but adapter authors who do LAMBDA_TMP_GB * 1_000_000_000 will be off by 7% in either direction.

Verdict: APPROVE
Reasoning: The cap mechanics are correct, the typed error model matches the §9.3 contract from the design doc, and the tests cover the happy path + override. The disk-cleanup-on-throw is worth a follow-up commit but not a blocker — Lambda (the primary target) cleans up on container teardown, so the failure mode is only sharp on long-running worker deployments.

— Vai

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch from 5f0239a to 819a06c Compare May 14, 2026 00:14
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch from 8074417 to 24f0ffe Compare May 14, 2026 00:14
vanceingalls
vanceingalls previously approved these changes May 14, 2026
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review

PR rebased onto the new #809 base. The plan-size cap addition (PLAN_DIR_SIZE_LIMIT_BYTES = 2GiB, PlanTooLargeError, measurePlanDirBytes walking the planDir post-freeze) is identical to what I previously approved at 5f0239a. The bytes-formatter and walk skip symlinks per spec.

Net-new findings

None.

Verdict: APPROVE
Reasoning: No content change since prior approval; cap and error class are well-scoped. CI required checks green.

— Vai

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch from 24f0ffe to d0cd7f5 Compare May 14, 2026 00:47
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch 2 times, most recently from dbbd329 to 6aca0b4 Compare May 14, 2026 01:13
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch 3 times, most recently from 35239ea to b606106 Compare May 14, 2026 02:10
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch from 6aca0b4 to 8590f07 Compare May 14, 2026 02:10
@jrusso1020 jrusso1020 changed the base branch from 05-13-feat_producer_add_services_distributed_assemble.ts to graphite-base/814 May 14, 2026 02:38
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch from 8590f07 to e813fcb Compare May 14, 2026 02:39
@graphite-app graphite-app Bot changed the base branch from graphite-base/814 to main May 14, 2026 02:39
@graphite-app graphite-app Bot dismissed stale reviews from vanceingalls and miguel-heygen May 14, 2026 02:39

The base branch was changed.

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch from e813fcb to fbbd41a Compare May 14, 2026 02:39
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-stamping after rebase. LGTM.

@jrusso1020 jrusso1020 merged commit 3eb7ad2 into main May 14, 2026
39 checks passed
Copy link
Copy Markdown
Collaborator Author

Merge activity

@jrusso1020 jrusso1020 deleted the 05-13-feat_producer_enforce_plandir_size_cap_with_plan_too_large branch May 14, 2026 03:08
jrusso1020 added a commit that referenced this pull request May 14, 2026
…4) (#815)

## What

Phase 3 of the distributed rendering plan: §11 PR 3.5 format banlist. Extends `plan()` to refuse two v1-unsupported formats up front with a typed non-retryable `FormatNotSupportedInDistributedError` (`code === "FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED"`).

## Why

Both webm and HDR mp4 are documented as deferred to v1.5 (§7.2 + §12), but until this PR the only signal at the runtime layer is the in-process pipeline silently producing wrong output (chunk concat-copy doesn't round-trip VP9; HDR signaling gets stripped at the chunk boundary). Failing fast at `plan()` time keeps adopters from spending fan-out compute on a render that can't succeed and gives them a typed error code their workflow adapter can route on.

## How

- New exports in `services/distributed/plan.ts`:
  - `FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED` — non-retryable error code matching §11's wording.
  - `FormatNotSupportedInDistributedError` — typed error class with `code`, `format`, and `reason` fields. Message names the rejected format and tells adopters to fall back to the in-process renderer (`executeRenderJob`) which has full format support.
  - `rejectUnsupportedDistributedFormat(config)` — pure helper exported separately so adapters can run the same gate at their input layer (Step Functions input validation, Temporal workflow start) before the activity even runs.
- `plan()` calls `rejectUnsupportedDistributedFormat(config)` as the first line of the function — BEFORE `mkdirSync(planDir)` so a banned input never produces a partial planDir.
- Replaced the previous ad-hoc `if (hdrMode === "force-hdr") throw new Error(...)` with the typed error class.

### What did NOT change

`executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. The in-process renderer continues to accept webm + HDR (its existing functionality).

## Test plan

- [x] Unit tests added — `packages/producer/src/services/distributed/planFormatBanlist.test.ts`. 5 cases:
  - `rejectUnsupportedDistributedFormat` accepts the v1-supported formats (mp4, mov, png-sequence) with both `auto` and `force-sdr` hdrMode.
  - Rejects webm — error has `code === FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED`, `format === "webm"`, message mentions in-process renderer.
  - Rejects HDR mp4 (`hdrMode === "force-hdr"`) — error has `format === "mp4-hdr"`, message mentions HDR.
  - End-to-end via `plan()`: webm throws with no planDir leaking to disk.
  - End-to-end via `plan()`: HDR mp4 throws with no planDir leaking to disk.
- [x] `bun test packages/producer/src/services/distributed/` — 30 pass.
- [x] `bun run --filter @hyperframes/producer typecheck` — clean.
- [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files.
- [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold.

This is PR 5 of a 6-PR Phase 3 stack:

- 3.1 — `services/distributed/plan.ts` (#808)
- 3.2 — `services/distributed/renderChunk.ts` (#809)
- 3.3 — `services/distributed/assemble.ts` (#813)
- 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) (#814)
- **3.5 (this PR)** — distributed format banlist (webm + HDR mp4)
- 3.6 — public exports + `@hyperframes/producer/distributed` subpath

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants