Skip to content

feat(producer): add services/distributed/renderChunk.ts#809

Merged
jrusso1020 merged 1 commit into
mainfrom
05-13-feat_producer_add_services_distributed_renderchunk.ts
May 14, 2026
Merged

feat(producer): add services/distributed/renderChunk.ts#809
jrusso1020 merged 1 commit into
mainfrom
05-13-feat_producer_add_services_distributed_renderchunk.ts

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 13, 2026

What

Phase 3 of the distributed rendering plan: the second public primitive. Adds renderChunk(planDir, chunkIndex, outputChunkPath) and its supporting types as packages/producer/src/services/distributed/renderChunk.ts. See DISTRIBUTED-RENDERING-PLAN.md §11 Phase 3.

Why

plan() (PR #808) produces the planDir. renderChunk() is what every chunk worker actually runs — Temporal activity, Lambda Step Functions Map task, Cloud Run Job invocation. Phase 1 extracted capture/encode stages; Phase 2 added the determinism-hardening flags those stages need. This PR is the first caller that flips them true for the capture + encode path (PR #808 was the first caller for compileForRender's flag).

How

renderChunk composes the pipeline:

  1. Read + validate plan.json, meta/{composition,encoder,chunks}.json. Out-of-range chunkIndex, missing artifacts, or browserGpuMode !== "software" trip a typed RenderChunkValidationError with code === PLAN_HASH_MISMATCH / BROWSER_GPU_NOT_SOFTWARE.
  2. readFfmpegVersion() matches against plan.ffmpegVersion — any drift trips a non-retryable FFMPEG_VERSION_MISMATCH per §9.3.
  3. applyRuntimeEnvSnapshot(encoder.runtimeEnv) BEFORE the file server is created — RENDER_MODE_SCRIPT bakes those env vars into served HTML at module load.
  4. File server points at <planDir>/compiled/ with buildVirtualTimeShim({ seedRandomFromFrame: true }) (PR 2.4 surface).
  5. createCaptureSession with lockWarmupTicks: true (PR 2.3 surface).
  6. assertSwiftShader(session.page, readWebGlVendorInfoFromCanvas) BEFORE initializeSession. The default assertSwiftShader reader navigates to chrome://gpu, which chrome-headless-shell serves as an empty document on multiple builds we've tested; this PR threads a canvas + WEBGL_debug_renderer_info reader through the helper's existing readInfo override so the assertion works on both regular Chrome and chrome-headless-shell. No fork of the Phase 2 helper.
  7. discardWarmupCapture(session, startFrame, startTime) once before the chunk's first real frame (PR 2.6 surface).
  8. runCaptureStage with frameRange: { startFrame, endFrame } and workerCount: 1. The new optional frameRange field on CaptureStageInput extends the sequential capture branch: per-frame TIMES use absolute composition frame indices (so virtual time matches an in-process render at that frame), file NAMES normalize to zero (so the encoder reads them without an -start_number override). In-process callers omit frameRange and get the existing [0, totalFrames) behavior verbatim.
  9. runEncodeStage with lockGopForChunkConcat: true + gopSize: framesInChunk + hasAudio: false. Closed-GOP per §7.1 so concat-copy at assemble time round-trips losslessly. The two new optional fields are pass-through to EncoderOptions; in-process callers omit them.
  10. Hash output via SHA-256 (file for mp4/mov; sorted-frame-fingerprint for png-sequence), write a perf sidecar, return ChunkResult.

Changes outside the new module

File Change Backwards compat
render/stages/captureStage.ts Optional frameRange?: { startFrame, endFrame } on CaptureStageInput. Sequential branch only; rejects workerCount > 1 && frameRange. frameRange === undefined → identical behavior to [0, totalFrames).
render/stages/encodeStage.ts Optional lockGopForChunkConcat? + gopSize? on EncodeStageInput, pass-through to EncoderOptions. Defaults are pass-through; in-process call site omits both.

renderOrchestrator.ts, the hyperframes render CLI, and the producer HTTP routes are untouched.

Test plan

  • Unit tests added — packages/producer/src/services/distributed/renderChunk.test.ts. 3 cases:
    • Byte-identical retry contract (§5.1) — renders chunk 0 twice on the same planDir and asserts the ChunkResult.sha256 matches.
    • OOB chunkIndex rejected before Chrome init.
    • Missing plan.json rejected before Chrome init.
  • bun test packages/producer/src/services/distributed/ — 13 pass (PR 3.1 + 3.2 combined).
  • bun test packages/producer/src/ — 315 pass, 1 fail. The one failure is the pre-existing writeCompiledArtifacts — external assets on Windows drive-letter paths flake that also fails on a clean checkout of origin/main.
  • bun run --filter @hyperframes/producer typecheck — clean.
  • bunx oxlint + bunx oxfmt --check — clean on changed files.
  • Producer Docker regression harness — pending CI. executeRenderJob is unchanged here; PSNR baselines should hold.

Known limitation surfaced during testing

chrome-headless-shell on some dev/CI hosts can't initialize SwiftShader (the GL stack errors out before BeginFrame can run). The byte-identical retry test soft-skips on those hosts and the Docker harness (where the chrome-headless-shell build is matched to the planDir's ffmpegVersion) is the source of truth for the determinism contract. To exercise the test locally on a host with a working chrome-headless-shell, the test fixture uses format: "png-sequence" which forces screenshot mode and avoids the BeginFrame dependency.

This is PR 2 of a 6-PR Phase 3 stack:

  • 3.1 — services/distributed/plan.ts (feat(producer): add services/distributed/plan.ts #808)
  • 3.2 (this PR)services/distributed/renderChunk.ts
  • 3.3 — services/distributed/assemble.ts
  • 3.4 — planDir size cap (PLAN_TOO_LARGE)
  • 3.5 — distributed format banlist (webm + HDR mp4)
  • 3.6 — public exports + @hyperframes/producer/distributed subpath

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: renderChunk.ts — Distributed Chunk Rendering

The #1 concern — frame offset handling — is correct. The critical path through discardWarmupCapturerunCaptureStage uses absolute frame indices for time calculation while relative indices for file naming. This is the proper fix for the original frameStartOffset bug.

Resource cleanup is thorough (try/finally, session null-guard, work dir preserved on failure for debugging). Typed validation errors with string codes enable workflow adapters to mark failures as non-retryable. Validations run before Chrome init — correct fail-fast ordering.

Minor nits

  1. inferQualityFromEncoder incomplete (shared.ts): Maps prores-software and png-sequence to "high", everything else to "standard". If new encoder presets are added (e.g., AV1), this silently defaults. Consider an exhaustive check or log warning for unrecognized encoders.

  2. outputResolution omitted in buildSyntheticRenderJob (~line 407): Defaults to undefined. If the plan was created with an explicit output resolution, chunks render at source dimensions. Verify that plan.ts bakes resolution into width/height in plan.json.

  3. audioOutputPath made optional in encodeStage.ts: The png path guards correctly, but the non-png path still receives audioOutputPath downstream. Low risk since renderChunk always passes hasAudio: false, but a defensive check would prevent future caller issues.

None of these are blockers. LGTM.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 3 Activity B — renderChunk composes a single chunk's capture+encode against the planDir contract. The byte-identical-retry framing is right, the SwiftShader-via-canvas workaround for chrome-headless-shell is a clean fix, and the new frameRange field on captureStage threads cleanly without breaking the in-process branch. A few real gaps below.

Strengths

  • renderChunk.ts:538-572readWebGlVendorInfoFromCanvas is the right answer to chrome-headless-shell's empty chrome://gpu document. The fallback to gl.VENDOR/gl.RENDERER when the extension is missing keeps it robust on stripped-down Chrome builds.
  • captureStage.ts:1148-1166 — the absolute-frame-index time + zero-normalized file name split is the contract that lets in-process renders and chunk renders converge on byte-identical pixels at a given absolute frame. Cited at captureStage.ts:1162 (time = (absoluteIdx * fps.den) / fps.num).
  • shared.ts:925-942 — caching ffmpeg -version after first read saves 20-50ms per chunk in long-running workers (Cloud Run Job, Temporal activity worker). Small but exactly the right kind of caching: an immutable fact, hashed into planHash so cache invalidation is moot.
  • renderChunk.test.ts:281-345 — the soft-skip pattern for chrome-headless-shell-broken hosts (chrome:\/\/gpu|BROWSER_GPU_NOT_SOFTWARE|SwiftShader|... regex) keeps local dev from grinding on a host-Chrome bug while still asserting the contract inside the Docker harness. Pragmatic.

Findings

blocker — renderChunk never actually validates planHash against the planDir bytes

plan.json.planHash is read into plan.planHash (renderChunk.ts:626) and copied verbatim into the perf sidecar at line 863, but renderChunk NEVER recomputes the hash from the on-disk contents to verify the worker is reading the same bytes the controller wrote. The PLAN_HASH_MISMATCH code is reused for two semantically different errors: "missing required file" (lines 620, 632) and "missing slice" (line 640). The actual contract — "this planDir's content-addressed identity matches the value the controller computed" — is never checked.

Failure mode: a corrupted planDir (truncated meta/encoder.json, partial S3 download, malicious tampering of compiled/index.html) renders silently with the wrong bytes. Every chunk reports sha256 based on its own output — but those outputs are derived from a planDir the worker never validated. The whole point of planHash (plan.ts:771 computes it, freezePlan.ts:266 returns it) is to catch exactly this.

Two options to fix:

  1. Recompute planHash on the worker at boot (cost: a re-walk of <planDir>/ and a fresh sha256 of every asset). Throws PLAN_HASH_MISMATCH when the recomputed hash != plan.planHash. This is what the design doc §9.3 implies.
  2. If full re-hash is too expensive per chunk in a long-running worker, recompute once and memoize against the planDir path so subsequent chunks on the same planDir skip the work.

Either way, the gap matters because the whole non-retryable error model leans on planHash being load-bearing. As-is, the code says it's load-bearing but doesn't actually load-bear.

important — applyRuntimeEnvSnapshot mutates process.env globally with no restore (runtimeEnvSnapshot.ts:1068-1078)

applyRuntimeEnvSnapshot(encoder.runtimeEnv ?? {}) overwrites process.env keys and never restores them. For a worker that runs ONE chunk per process (Lambda) this is fine. For a worker that runs MANY chunks per process (Cloud Run Job, Temporal activity worker — both called out as target deployments in the PR body), this pollutes env across chunks: chunk N's env sticks around when chunk N+1 boots, and if chunk N+1's planDir has a different (or absent) RENDER_MODE_SCRIPT key, the leaked value wins.

The RUNTIME_ENV_PREFIXES filter narrows the blast radius but doesn't eliminate the problem — keys within the prefix range can still cross-contaminate. Two fixes:

  1. Snapshot the pre-apply values for keys matching RUNTIME_ENV_PREFIXES, restore them in renderChunk's finally block. Same lifecycle as the captureSession close — chunk-scoped.
  2. Document the hard contract: "one chunk per process; multi-chunk processes must unset between chunks." Then enforce it via a processChunkCounter that throws on second invocation.

I'd take (1) — it's the contract callers actually expect from a "pure function over local paths" (which the module docstring at line 408 explicitly claims). Mutating global process state silently is the opposite of pure.

important — renderChunk.ts:705 work directory naming collides on retry of the same (planDir, chunkIndex)

const workDir = \${outputChunkPath}.work`;— andrenderChunk.ts:706-707`:

if (existsSync(workDir)) rmSync(workDir, { recursive: true, force: true });
mkdirSync(workDir, { recursive: true });

The blanket pre-delete is fine for sequential retry — but for two concurrent invocations on the SAME (planDir, chunkIndex) (e.g., a Temporal scheduler that double-fires due to heartbeat skew), they'd race: A creates work, B rms A's work mid-render, A errors on missing files. The byte-identical-retry contract assumes serial retries on the same path; double-fire breaks it.

Either rename to include a unique suffix (${outputChunkPath}.work.${process.pid}.${randomBytes(4).toString("hex")}) or document "callers must serialize retries on the same outputChunkPath." Same applies to outputChunkPath itself — if A and B both write to it concurrently, the resulting bytes are undefined.

important — RenderChunkValidationError overloads PLAN_HASH_MISMATCH for non-hash errors (renderChunk.ts:620, 632, 640, 697)

The code raises PLAN_HASH_MISMATCH for four distinct conditions:

  • missing required artifact (line 620)
  • chunkIndex out of range (line 632)
  • chunk slice undefined (line 640)
  • compiled/ directory missing (line 697)

None of these are actually plan-hash mismatches — they're shape/structural errors with the planDir. Adapters routing retry policies off error.code === "PLAN_HASH_MISMATCH" will lump cross-version drift (the real PLAN_HASH_MISMATCH case) together with "user passed a chunkIndex bigger than the chunk count" (a caller bug). Distinct error codes:

  • MISSING_PLAN_ARTIFACT for the file-existence checks
  • CHUNK_INDEX_OUT_OF_RANGE for the bounds check
  • PLAN_HASH_MISMATCH reserved for the actual content-hash drift case (which today doesn't fire at all — see the blocker above)

nit — encoder.runtimeEnv ?? {} swallows the case where the controller forgot to snapshot (renderChunk.ts:671)

A planDir produced before snapshotRuntimeEnv landed (or by a buggy controller) has no runtimeEnv key — the chunk worker silently applies an empty snapshot and proceeds, while the controller's render captured a different set of env vars. The chunk renders with default env → pixels diverge → the planHash mismatch is detected (if you fix the blocker above) but the diagnostic message is confusing. Reject explicitly: if (!encoder.runtimeEnv) throw ... ("planDir missing runtimeEnv snapshot — re-plan with current producer").

nit — inferQualityFromEncoder (shared.ts:1040-1044) is a fragile inverse

if (encoder === "prores-software" || encoder === "png-sequence") return "high";
return "standard";

If a future encoder option (libx264-software-fast, prores-422-hq) lands and the table in plan.ts:574-577 adds a new row, this function silently routes it to "standard". Either persist the original quality enum in LockedRenderConfig (which is content-addressed anyway — adding a field would bump planHash) or make the inverse exhaustive with a typed default-handling branch.

Verdict: REQUEST CHANGES
Reasoning: The planHash-never-validated gap is load-bearing — the whole non-retryable error model depends on it, and as-is it's dead. The env-mutation-no-restore breaks the documented "pure function over local paths" contract on multi-chunk workers (which the PR body explicitly targets). Both are pre-merge fixes, not follow-ups, because both will cause silent miscompiles on the path the architecture is being built for.

— Vai

@jrusso1020 jrusso1020 force-pushed the feat/producer-distributed-plan branch from a4c24f9 to eff4cf6 Compare May 14, 2026 00:14
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from c82179f to 1dbd54a Compare May 14, 2026 00:14
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

@vanceingalls — addressed all six items. Summary in roughly your blocker → nit order:

blocker — planHash never validated against planDir bytes ✅

Added recomputePlanHashFromPlanDir(planDir) in freezePlan.ts (re-uses collectPlanAssetShas so the framing is identical to what freezePlan writes; encoder JSON is consumed as raw bytes so round-trip drift on number printing / whitespace can't sneak in). renderChunk now calls it after the structural/version checks and before any browser work — recomputed hash ≠ plan.json.planHash trips PLAN_HASH_MISMATCH. New test "rejects a planDir with a tampered planHash" corrupts plan.json and asserts the typed throw.

important — applyRuntimeEnvSnapshot mutates process.env globally with no restore ✅

Took option (1) — applyRuntimeEnvSnapshot now returns { restore }. Internally it snapshots pre-apply values per touched key; restore() reverts (deleting keys that didn't exist before). renderChunk wraps everything after the apply call in try { ... } finally { envRestore.restore(); } so multi-chunk workers (Cloud Run Job, Temporal activity worker — both PR-body targets) get a clean env between chunks. Also added apply-side RUNTIME_ENV_PREFIXES filtering (covers the poisoned-planDir attack you raised: a planDir with PATH or LD_PRELOAD in its runtimeEnv is now silently dropped). Three new test cases in runtimeEnvSnapshot.test.ts: prefix filter rejects arbitrary keys, restore reverts touched keys only, restore is idempotent.

important — workDir naming collides on retry of same (planDir, chunkIndex)

workDir = \${outputChunkPath}.work.${process.pid}.${randomBytes(4).toString("hex")}`. Dropped the pre-delete (was the half of the race that caused the bug — A creates, B rms A's, A errors on missing files). Kept the contract that concurrent writers to outputChunkPath` produce undefined bytes — that's still the caller's serialization responsibility — but at least the workDirs no longer race.

important — PLAN_HASH_MISMATCH overloaded for non-hash errors ✅

Split into four distinct codes:

  • MISSING_PLAN_ARTIFACT — file-existence checks for plan.json / meta/*.json / compiled/
  • CHUNK_INDEX_OUT_OF_RANGE — bounds check + missing slice + non-positive frame count
  • PLAN_HASH_MISMATCH — exclusively reserved for the on-disk fingerprint mismatch (which now actually fires per the blocker fix)
  • MISSING_RUNTIME_ENV_SNAPSHOT — see next item

RenderChunkValidationError.code narrowed to a literal union of all five (plus BROWSER_GPU_NOT_SOFTWARE and FFMPEG_VERSION_MISMATCH). Tests updated to assert on code rather than message substrings.

nit — encoder.runtimeEnv ?? {} swallows the missing-snapshot case ✅

Now throws RenderChunkValidationError(MISSING_RUNTIME_ENV_SNAPSHOT, ...) explicitly when the field is absent. Diagnostic message names the re-plan remediation. New test covers it (the corruption-induced planHash mismatch fires first if encoder.json is tampered, but the test accepts either typed code).

nit — inferQualityFromEncoder fragile inverse ✅

Persisted quality in LockedRenderConfig (PR #808 change — additive, flows into planHash via canonical JSON, so cross-version drift is detected as PLAN_HASH_MISMATCH). renderChunk reads encoder.quality directly. inferQualityFromEncoder deleted from shared.ts.

Tests: 13 in renderChunk.test.ts + 3 new in runtimeEnvSnapshot.test.ts, all green. Full stack at 321 pass / 1 fail (the failing test is the pre-existing writeCompiledArtifacts flake on origin/main, unrelated).

miguel-heygen
miguel-heygen previously approved these changes May 14, 2026
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review

All items addressed:

  • planHash validationrecomputePlanHashFromPlanDir added, called before browser work. Reads encoder JSON as raw bytes to avoid round-trip drift. Test confirms tampered hash triggers PLAN_HASH_MISMATCH.
  • env restoreapplyRuntimeEnvSnapshot now returns { restore }, called in finally. Prefix filtering added as defense-in-depth. Three new tests cover filtering, restore, and idempotent double-restore.
  • Error code splitMISSING_PLAN_ARTIFACT, CHUNK_INDEX_OUT_OF_RANGE, MISSING_RUNTIME_ENV_SNAPSHOT, PLAN_HASH_MISMATCH are now distinct codes.
  • inferQualityFromEncoder — deleted entirely; quality read from LockedRenderConfig directly.
  • outputResolution — now passed through SyntheticRenderJobInput.
  • audioOutputPath — runtime guard added.

Thorough fixes. LGTM.

@vanceingalls vanceingalls dismissed their stale review May 14, 2026 00:39

Re-reviewed at HEAD 1dbd54a — all 3 findings addressed. New review incoming.

vanceingalls
vanceingalls previously approved these changes May 14, 2026
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review

All three prior blockers/important findings addressed. Strong implementation work — the planHash validator threading is exactly the shape the contract needs, the {restore} handle reverses the global-state regression cleanly, and the error-code split now lets adapters route retry policies on real semantic distinctions. Plus the workDir collision (important) and runtimeEnv ?? {} swallow (nit) both fixed as a bonus.

Prior findings — status

# severity status citation
1a blocker addressed renderChunk.ts:774-783recomputePlanHashFromPlanDir(planDir) recomputes the on-disk fingerprint and compares to plan.planHash. Tampered-hash test at renderChunk.test.ts:397-424 confirms PLAN_HASH_MISMATCH fires on corruption.
1b blocker addressed renderChunk.ts:538-542MISSING_PLAN_ARTIFACT, CHUNK_INDEX_OUT_OF_RANGE, MISSING_RUNTIME_ENV_SNAPSHOT are now distinct codes. PLAN_HASH_MISMATCH is reserved for actual content-fingerprint drift (lines 774-783).
2 important addressed runtimeEnvSnapshot.ts:1262-1282applyRuntimeEnvSnapshot now returns { restore } after snapshotting pre-apply values. renderChunk.ts:802 captures the handle; renderChunk.ts:1020-1024 invokes envRestore.restore() in finally so error-path cleanup is symmetric. Three new unit tests at runtimeEnvSnapshot.test.ts:1182+ cover the filter, restore, and idempotent double-restore.
3 important addressed renderChunk.ts:831 — workDir now suffixed with ${process.pid}.${randomBytes(4).toString("hex")}. Concurrent invocations on the same (planDir, chunkIndex) no longer race.
4 nit addressed renderChunk.ts:789-794runtimeEnv absence now throws MISSING_RUNTIME_ENV_SNAPSHOT with a clear re-plan diagnostic rather than the silent ?? {} fallback.
5 nit addressed shared.ts no longer contains inferQualityFromEncoder; quality is read from LockedRenderConfig directly per Magi's note.

Net-new strengths

  • renderChunk.ts:786-794 — distinct MISSING_RUNTIME_ENV_SNAPSHOT code with a typeof encoder.runtimeEnv !== "object" guard. Aligns the "missing" and "wrong-type" paths to the same diagnostic.
  • applyRuntimeEnvSnapshot returns the prefix-filtered apply set in previous, so restore() is scoped to keys this call touched — keys the worker host added (e.g. HYPERFRAMES_EXTRACT_CACHE_DIR) stay untouched.

Net-new findings

None gating. The contract is enforced.

Verdict: APPROVE
Reasoning: Every prior finding addressed with citations, plus the workDir collision and silent runtimeEnv-fallback nits I flagged. Tests cover the new code paths (tampered-hash, OOB index, missing artifact, missing runtimeEnv). CI green on required checks; optional regression-shards still running.

— Vai

@jrusso1020 jrusso1020 changed the base branch from feat/producer-distributed-plan to graphite-base/809 May 14, 2026 00:46
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from 1dbd54a to 2fe3dde Compare May 14, 2026 00:47
@graphite-app graphite-app Bot changed the base branch from graphite-base/809 to main May 14, 2026 00:47
@graphite-app graphite-app Bot dismissed stale reviews from miguel-heygen and vanceingalls May 14, 2026 00:47

The base branch was changed.

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from 2fe3dde to 9e470d1 Compare May 14, 2026 00:47
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-stamping after rebase. LGTM.

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from 9e470d1 to 1dbd54a Compare May 14, 2026 01:13
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from 1dbd54a to e55a8d0 Compare May 14, 2026 01:44
Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented May 14, 2026

Merge activity

  • May 14, 1:45 AM UTC: Graphite rebased this pull request as part of a merge.
  • May 14, 2:09 AM UTC: @jrusso1020 merged this pull request with Graphite.

@jrusso1020 jrusso1020 merged commit 602ef87 into main May 14, 2026
39 checks passed
@jrusso1020 jrusso1020 deleted the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch May 14, 2026 02:09
jrusso1020 added a commit that referenced this pull request May 14, 2026
## What

Phase 3 of the distributed rendering plan: the third public primitive. Adds `assemble(planDir, chunkPaths, audioPath, outputPath)` and its supporting types at `packages/producer/src/services/distributed/assemble.ts`. See `DISTRIBUTED-RENDERING-PLAN.md` §11 Phase 3.

## Why

`plan()` (#808) and `renderChunk()` (#809) produce the planDir and the per-chunk outputs respectively. `assemble()` is what every distributed fan-out workflow runs last: it stitches the chunks into the final deliverable. Without it, the planDir → chunks chain stops at a list of files; nothing produces the user-facing mp4/mov/png-sequence.

## How

`assemble()` branches on the planDir's encoder format:

**mp4 / mov**: ffmpeg `-f concat -c copy` over the ordered chunk paths. Each chunk's first frame is an IDR keyframe (PR 3.2 set `lockGopForChunkConcat: true`), so concat-copy round-trips losslessly. The concatenated output is then:
1. Passed through `padOrTrimAudioToVideoFrameCount` (PR 2.7 surface) when `audioPath` is non-null, so audio length is exactly `frameCount / fps` rather than the audio mixer's original-duration output.
2. Muxed with the normalized audio via the engine's `muxVideoWithAudio` (same helper the in-process renderer's `assembleStage` uses).
3. Passed through `applyFaststart` so the `moov` atom moves to the file's start.

When no audio is present, the concat output skips mux and goes straight to `applyFaststart`.

**png-sequence**: chunks are directories of `frame_NNNNNN.png` files numbered locally per chunk. `assemble()` merges them with a continuous global index so chunk 0's `frame_000000.png` lands at `frame_000001.png` in the output, chunk 1's first frame becomes `frame_(N+1)`, etc. When `audioPath` is non-null we copy it alongside as `audio.aac` so callers who need to re-mux later have it.

### Validation

Both branches assert `chunkPaths.length === chunks.length` (the value read from `meta/chunks.json`) and that each chunk path exists. A missing or mismatched manifest trips a typed error before any ffmpeg invocation.

### What did NOT change

No engine helpers, no Phase 1 stages, no in-process orchestrator. `assemble()` reuses `muxVideoWithAudio` / `applyFaststart` / `runFfmpeg` / `padOrTrimAudioToVideoFrameCount` exactly as they exist — the in-process `runAssembleStage` is intentionally not called because it operates on a `RenderJob` and emits `updateJobStatus` payloads, neither of which the distributed activity has.

## Test plan

- [x] Unit tests added — `packages/producer/src/services/distributed/assemble.test.ts`. 5 cases:
  - Concat-copies two mp4 chunks and applies faststart (ffprobe asserts codec, frame count, atom order).
  - Muxes audio with `frame-count-derived` duration when `audio.aac` is present (ffprobe asserts audio duration within 50ms of `totalFrames / fps`).
  - Merges png-sequence chunk directories with continuous global numbering (asserts filenames `frame_000001..frame_000007`).
  - Rejects mismatched `chunkPaths.length` vs `chunks.json.length`.
  - Rejects a planDir missing `plan.json`.
- [x] `bun test packages/producer/src/services/distributed/` — 18 pass (PRs 3.1 + 3.2 + 3.3).
- [x] `bun run --filter @hyperframes/producer typecheck` — clean.
- [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files.
- [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold.

The mp4 fixture pre-renders test inputs via raw ffmpeg (`testsrc` filter + closed-GOP libx264) rather than going through the Chrome capture pipeline. This isolates concat-copy + mux + faststart from the renderChunk path that PRs 3.1/3.2 already cover, and avoids the chrome-headless-shell smoke-test gating that PR 3.2 needed.

This is PR 3 of a 6-PR Phase 3 stack:

- 3.1 — `services/distributed/plan.ts` (#808)
- 3.2 — `services/distributed/renderChunk.ts` (#809)
- **3.3 (this PR)** — `services/distributed/assemble.ts`
- 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`)
- 3.5 — distributed format banlist (webm + HDR mp4)
- 3.6 — public exports + `@hyperframes/producer/distributed` subpath

🤖 Generated with [Claude Code](https://claude.com/claude-code)
jrusso1020 added a commit that referenced this pull request May 14, 2026
## What

Phase 3 of the distributed rendering plan: §6.4 / §9.3 size cap. Extends `plan()` to measure the produced planDir's total byte size after freeze and throw a typed non-retryable `PlanTooLargeError` (`code === "PLAN_TOO_LARGE"`) when the planDir exceeds 2 GB.

## Why

Distributed chunk workers ship the entire planDir to whatever ephemeral storage they're running on — `/tmp` on AWS Lambda (10 GB), the container filesystem on Cloud Run Jobs, etc. A planDir that doesn't fit can't be rendered. v1.5 lifts this cap via per-chunk video-frame slicing (§12); for now v1 fails fast at plan time so adapters don't waste a fan-out attempt that's guaranteed to OOM.

The 2 GB ceiling specifically targets Lambda's 10 GB `/tmp`: planDir + per-chunk captured frames + ffmpeg's working set all share that budget, and 2 GB leaves ~8 GB for capture/encode at 4K SDR.

## How

- New exports in `services/distributed/plan.ts`:
  - `PLAN_DIR_SIZE_LIMIT_BYTES` — the 2 GB constant.
  - `PLAN_TOO_LARGE` — the non-retryable error code (matches §9.3).
  - `PlanTooLargeError` — typed error class carrying `code`, `sizeBytes`, `limitBytes`, and a message that points adopters at the v1.5 slicing roadmap + the in-process renderer escape hatch.
  - `measurePlanDirBytes(planDir)` — recursive on-disk size walker. Symlinks skipped intentionally.
- `DistributedRenderConfig.planDirSizeLimitBytes?: number` — optional override. Defaults to `PLAN_DIR_SIZE_LIMIT_BYTES`. Tests pass a tiny cap (1024 bytes) to exercise the throw path without filling 2 GB of /tmp.
- The check runs in `plan()` AFTER the temp work tree is removed (so `.plan-work/` doesn't double-count) but BEFORE the function returns — adapters that catch the error never see a `PlanResult`.

### What did NOT change

`executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. Only `plan()` (which is itself opt-in) enforces the cap.

## Test plan

- [x] Unit tests added — `packages/producer/src/services/distributed/planSizeCap.test.ts`. 7 cases:
  - `measurePlanDirBytes` returns 0 for an empty dir, sums recursively, and gracefully ignores broken entries.
  - `PLAN_DIR_SIZE_LIMIT_BYTES` is `2 * 1024 * 1024 * 1024` (§6.4 pin).
  - `PlanTooLargeError` carries the `PLAN_TOO_LARGE` code + `sizeBytes` + `limitBytes` and mentions the v1.5 escape hatch.
  - `plan()` throws `PlanTooLargeError` when configured with a 1024-byte ceiling.
  - `plan()` succeeds when the default 2 GB ceiling is well above the produced planDir.
- [x] `bun test packages/producer/src/services/distributed/` — 25 pass (PRs 3.1 + 3.2 + 3.3 + 3.4).
- [x] `bun run --filter @hyperframes/producer typecheck` — clean.
- [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files.
- [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold.

This is PR 4 of a 6-PR Phase 3 stack:

- 3.1 — `services/distributed/plan.ts` (#808)
- 3.2 — `services/distributed/renderChunk.ts` (#809)
- 3.3 — `services/distributed/assemble.ts` (#813)
- **3.4 (this PR)** — `planDir` size cap (`PLAN_TOO_LARGE`)
- 3.5 — distributed format banlist (webm + HDR mp4)
- 3.6 — public exports + `@hyperframes/producer/distributed` subpath

🤖 Generated with [Claude Code](https://claude.com/claude-code)
jrusso1020 added a commit that referenced this pull request May 14, 2026
…4) (#815)

## What

Phase 3 of the distributed rendering plan: §11 PR 3.5 format banlist. Extends `plan()` to refuse two v1-unsupported formats up front with a typed non-retryable `FormatNotSupportedInDistributedError` (`code === "FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED"`).

## Why

Both webm and HDR mp4 are documented as deferred to v1.5 (§7.2 + §12), but until this PR the only signal at the runtime layer is the in-process pipeline silently producing wrong output (chunk concat-copy doesn't round-trip VP9; HDR signaling gets stripped at the chunk boundary). Failing fast at `plan()` time keeps adopters from spending fan-out compute on a render that can't succeed and gives them a typed error code their workflow adapter can route on.

## How

- New exports in `services/distributed/plan.ts`:
  - `FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED` — non-retryable error code matching §11's wording.
  - `FormatNotSupportedInDistributedError` — typed error class with `code`, `format`, and `reason` fields. Message names the rejected format and tells adopters to fall back to the in-process renderer (`executeRenderJob`) which has full format support.
  - `rejectUnsupportedDistributedFormat(config)` — pure helper exported separately so adapters can run the same gate at their input layer (Step Functions input validation, Temporal workflow start) before the activity even runs.
- `plan()` calls `rejectUnsupportedDistributedFormat(config)` as the first line of the function — BEFORE `mkdirSync(planDir)` so a banned input never produces a partial planDir.
- Replaced the previous ad-hoc `if (hdrMode === "force-hdr") throw new Error(...)` with the typed error class.

### What did NOT change

`executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. The in-process renderer continues to accept webm + HDR (its existing functionality).

## Test plan

- [x] Unit tests added — `packages/producer/src/services/distributed/planFormatBanlist.test.ts`. 5 cases:
  - `rejectUnsupportedDistributedFormat` accepts the v1-supported formats (mp4, mov, png-sequence) with both `auto` and `force-sdr` hdrMode.
  - Rejects webm — error has `code === FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED`, `format === "webm"`, message mentions in-process renderer.
  - Rejects HDR mp4 (`hdrMode === "force-hdr"`) — error has `format === "mp4-hdr"`, message mentions HDR.
  - End-to-end via `plan()`: webm throws with no planDir leaking to disk.
  - End-to-end via `plan()`: HDR mp4 throws with no planDir leaking to disk.
- [x] `bun test packages/producer/src/services/distributed/` — 30 pass.
- [x] `bun run --filter @hyperframes/producer typecheck` — clean.
- [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files.
- [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold.

This is PR 5 of a 6-PR Phase 3 stack:

- 3.1 — `services/distributed/plan.ts` (#808)
- 3.2 — `services/distributed/renderChunk.ts` (#809)
- 3.3 — `services/distributed/assemble.ts` (#813)
- 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) (#814)
- **3.5 (this PR)** — distributed format banlist (webm + HDR mp4)
- 3.6 — public exports + `@hyperframes/producer/distributed` subpath

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants