feat(producer): add services/distributed/renderChunk.ts#809
Conversation
c683dbc to
c82179f
Compare
83a0963 to
a4c24f9
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: renderChunk.ts — Distributed Chunk Rendering
The #1 concern — frame offset handling — is correct. The critical path through discardWarmupCapture → runCaptureStage uses absolute frame indices for time calculation while relative indices for file naming. This is the proper fix for the original frameStartOffset bug.
Resource cleanup is thorough (try/finally, session null-guard, work dir preserved on failure for debugging). Typed validation errors with string codes enable workflow adapters to mark failures as non-retryable. Validations run before Chrome init — correct fail-fast ordering.
Minor nits
-
inferQualityFromEncoderincomplete (shared.ts): Mapsprores-softwareandpng-sequenceto"high", everything else to"standard". If new encoder presets are added (e.g., AV1), this silently defaults. Consider an exhaustive check or log warning for unrecognized encoders. -
outputResolutionomitted inbuildSyntheticRenderJob(~line 407): Defaults toundefined. If the plan was created with an explicit output resolution, chunks render at source dimensions. Verify thatplan.tsbakes resolution intowidth/heightinplan.json. -
audioOutputPathmade optional inencodeStage.ts: The png path guards correctly, but the non-png path still receivesaudioOutputPathdownstream. Low risk sincerenderChunkalways passeshasAudio: false, but a defensive check would prevent future caller issues.
None of these are blockers. LGTM.
vanceingalls
left a comment
There was a problem hiding this comment.
Phase 3 Activity B — renderChunk composes a single chunk's capture+encode against the planDir contract. The byte-identical-retry framing is right, the SwiftShader-via-canvas workaround for chrome-headless-shell is a clean fix, and the new frameRange field on captureStage threads cleanly without breaking the in-process branch. A few real gaps below.
Strengths
renderChunk.ts:538-572—readWebGlVendorInfoFromCanvasis the right answer to chrome-headless-shell's emptychrome://gpudocument. The fallback togl.VENDOR/gl.RENDERERwhen the extension is missing keeps it robust on stripped-down Chrome builds.captureStage.ts:1148-1166— the absolute-frame-index time + zero-normalized file name split is the contract that lets in-process renders and chunk renders converge on byte-identical pixels at a given absolute frame. Cited atcaptureStage.ts:1162(time = (absoluteIdx * fps.den) / fps.num).shared.ts:925-942— cachingffmpeg -versionafter first read saves 20-50ms per chunk in long-running workers (Cloud Run Job, Temporal activity worker). Small but exactly the right kind of caching: an immutable fact, hashed into planHash so cache invalidation is moot.renderChunk.test.ts:281-345— the soft-skip pattern for chrome-headless-shell-broken hosts (chrome:\/\/gpu|BROWSER_GPU_NOT_SOFTWARE|SwiftShader|...regex) keeps local dev from grinding on a host-Chrome bug while still asserting the contract inside the Docker harness. Pragmatic.
Findings
blocker — renderChunk never actually validates planHash against the planDir bytes
plan.json.planHash is read into plan.planHash (renderChunk.ts:626) and copied verbatim into the perf sidecar at line 863, but renderChunk NEVER recomputes the hash from the on-disk contents to verify the worker is reading the same bytes the controller wrote. The PLAN_HASH_MISMATCH code is reused for two semantically different errors: "missing required file" (lines 620, 632) and "missing slice" (line 640). The actual contract — "this planDir's content-addressed identity matches the value the controller computed" — is never checked.
Failure mode: a corrupted planDir (truncated meta/encoder.json, partial S3 download, malicious tampering of compiled/index.html) renders silently with the wrong bytes. Every chunk reports sha256 based on its own output — but those outputs are derived from a planDir the worker never validated. The whole point of planHash (plan.ts:771 computes it, freezePlan.ts:266 returns it) is to catch exactly this.
Two options to fix:
- Recompute planHash on the worker at boot (cost: a re-walk of
<planDir>/and a fresh sha256 of every asset). ThrowsPLAN_HASH_MISMATCHwhen the recomputed hash !=plan.planHash. This is what the design doc §9.3 implies. - If full re-hash is too expensive per chunk in a long-running worker, recompute once and memoize against the planDir path so subsequent chunks on the same planDir skip the work.
Either way, the gap matters because the whole non-retryable error model leans on planHash being load-bearing. As-is, the code says it's load-bearing but doesn't actually load-bear.
important — applyRuntimeEnvSnapshot mutates process.env globally with no restore (runtimeEnvSnapshot.ts:1068-1078)
applyRuntimeEnvSnapshot(encoder.runtimeEnv ?? {}) overwrites process.env keys and never restores them. For a worker that runs ONE chunk per process (Lambda) this is fine. For a worker that runs MANY chunks per process (Cloud Run Job, Temporal activity worker — both called out as target deployments in the PR body), this pollutes env across chunks: chunk N's env sticks around when chunk N+1 boots, and if chunk N+1's planDir has a different (or absent) RENDER_MODE_SCRIPT key, the leaked value wins.
The RUNTIME_ENV_PREFIXES filter narrows the blast radius but doesn't eliminate the problem — keys within the prefix range can still cross-contaminate. Two fixes:
- Snapshot the pre-apply values for keys matching
RUNTIME_ENV_PREFIXES, restore them inrenderChunk'sfinallyblock. Same lifecycle as the captureSession close — chunk-scoped. - Document the hard contract: "one chunk per process; multi-chunk processes must
unsetbetween chunks." Then enforce it via aprocessChunkCounterthat throws on second invocation.
I'd take (1) — it's the contract callers actually expect from a "pure function over local paths" (which the module docstring at line 408 explicitly claims). Mutating global process state silently is the opposite of pure.
important — renderChunk.ts:705 work directory naming collides on retry of the same (planDir, chunkIndex)
const workDir = \${outputChunkPath}.work`;— andrenderChunk.ts:706-707`:
if (existsSync(workDir)) rmSync(workDir, { recursive: true, force: true });
mkdirSync(workDir, { recursive: true });The blanket pre-delete is fine for sequential retry — but for two concurrent invocations on the SAME (planDir, chunkIndex) (e.g., a Temporal scheduler that double-fires due to heartbeat skew), they'd race: A creates work, B rms A's work mid-render, A errors on missing files. The byte-identical-retry contract assumes serial retries on the same path; double-fire breaks it.
Either rename to include a unique suffix (${outputChunkPath}.work.${process.pid}.${randomBytes(4).toString("hex")}) or document "callers must serialize retries on the same outputChunkPath." Same applies to outputChunkPath itself — if A and B both write to it concurrently, the resulting bytes are undefined.
important — RenderChunkValidationError overloads PLAN_HASH_MISMATCH for non-hash errors (renderChunk.ts:620, 632, 640, 697)
The code raises PLAN_HASH_MISMATCH for four distinct conditions:
- missing required artifact (line 620)
- chunkIndex out of range (line 632)
- chunk slice undefined (line 640)
- compiled/ directory missing (line 697)
None of these are actually plan-hash mismatches — they're shape/structural errors with the planDir. Adapters routing retry policies off error.code === "PLAN_HASH_MISMATCH" will lump cross-version drift (the real PLAN_HASH_MISMATCH case) together with "user passed a chunkIndex bigger than the chunk count" (a caller bug). Distinct error codes:
MISSING_PLAN_ARTIFACTfor the file-existence checksCHUNK_INDEX_OUT_OF_RANGEfor the bounds checkPLAN_HASH_MISMATCHreserved for the actual content-hash drift case (which today doesn't fire at all — see the blocker above)
nit — encoder.runtimeEnv ?? {} swallows the case where the controller forgot to snapshot (renderChunk.ts:671)
A planDir produced before snapshotRuntimeEnv landed (or by a buggy controller) has no runtimeEnv key — the chunk worker silently applies an empty snapshot and proceeds, while the controller's render captured a different set of env vars. The chunk renders with default env → pixels diverge → the planHash mismatch is detected (if you fix the blocker above) but the diagnostic message is confusing. Reject explicitly: if (!encoder.runtimeEnv) throw ... ("planDir missing runtimeEnv snapshot — re-plan with current producer").
nit — inferQualityFromEncoder (shared.ts:1040-1044) is a fragile inverse
if (encoder === "prores-software" || encoder === "png-sequence") return "high";
return "standard";If a future encoder option (libx264-software-fast, prores-422-hq) lands and the table in plan.ts:574-577 adds a new row, this function silently routes it to "standard". Either persist the original quality enum in LockedRenderConfig (which is content-addressed anyway — adding a field would bump planHash) or make the inverse exhaustive with a typed default-handling branch.
Verdict: REQUEST CHANGES
Reasoning: The planHash-never-validated gap is load-bearing — the whole non-retryable error model depends on it, and as-is it's dead. The env-mutation-no-restore breaks the documented "pure function over local paths" contract on multi-chunk workers (which the PR body explicitly targets). Both are pre-merge fixes, not follow-ups, because both will cause silent miscompiles on the path the architecture is being built for.
— Vai
a4c24f9 to
eff4cf6
Compare
c82179f to
1dbd54a
Compare
|
@vanceingalls — addressed all six items. Summary in roughly your blocker → nit order: blocker — planHash never validated against planDir bytes ✅Added important —
|
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-review
All items addressed:
- planHash validation —
recomputePlanHashFromPlanDiradded, called before browser work. Reads encoder JSON as raw bytes to avoid round-trip drift. Test confirms tampered hash triggersPLAN_HASH_MISMATCH. - env restore —
applyRuntimeEnvSnapshotnow returns{ restore }, called infinally. Prefix filtering added as defense-in-depth. Three new tests cover filtering, restore, and idempotent double-restore. - Error code split —
MISSING_PLAN_ARTIFACT,CHUNK_INDEX_OUT_OF_RANGE,MISSING_RUNTIME_ENV_SNAPSHOT,PLAN_HASH_MISMATCHare now distinct codes. - inferQualityFromEncoder — deleted entirely; quality read from
LockedRenderConfigdirectly. - outputResolution — now passed through
SyntheticRenderJobInput. - audioOutputPath — runtime guard added.
Thorough fixes. LGTM.
Re-reviewed at HEAD 1dbd54a — all 3 findings addressed. New review incoming.
vanceingalls
left a comment
There was a problem hiding this comment.
Re-review
All three prior blockers/important findings addressed. Strong implementation work — the planHash validator threading is exactly the shape the contract needs, the {restore} handle reverses the global-state regression cleanly, and the error-code split now lets adapters route retry policies on real semantic distinctions. Plus the workDir collision (important) and runtimeEnv ?? {} swallow (nit) both fixed as a bonus.
Prior findings — status
| # | severity | status | citation |
|---|---|---|---|
| 1a | blocker | addressed | renderChunk.ts:774-783 — recomputePlanHashFromPlanDir(planDir) recomputes the on-disk fingerprint and compares to plan.planHash. Tampered-hash test at renderChunk.test.ts:397-424 confirms PLAN_HASH_MISMATCH fires on corruption. |
| 1b | blocker | addressed | renderChunk.ts:538-542 — MISSING_PLAN_ARTIFACT, CHUNK_INDEX_OUT_OF_RANGE, MISSING_RUNTIME_ENV_SNAPSHOT are now distinct codes. PLAN_HASH_MISMATCH is reserved for actual content-fingerprint drift (lines 774-783). |
| 2 | important | addressed | runtimeEnvSnapshot.ts:1262-1282 — applyRuntimeEnvSnapshot now returns { restore } after snapshotting pre-apply values. renderChunk.ts:802 captures the handle; renderChunk.ts:1020-1024 invokes envRestore.restore() in finally so error-path cleanup is symmetric. Three new unit tests at runtimeEnvSnapshot.test.ts:1182+ cover the filter, restore, and idempotent double-restore. |
| 3 | important | addressed | renderChunk.ts:831 — workDir now suffixed with ${process.pid}.${randomBytes(4).toString("hex")}. Concurrent invocations on the same (planDir, chunkIndex) no longer race. |
| 4 | nit | addressed | renderChunk.ts:789-794 — runtimeEnv absence now throws MISSING_RUNTIME_ENV_SNAPSHOT with a clear re-plan diagnostic rather than the silent ?? {} fallback. |
| 5 | nit | addressed | shared.ts no longer contains inferQualityFromEncoder; quality is read from LockedRenderConfig directly per Magi's note. |
Net-new strengths
renderChunk.ts:786-794— distinctMISSING_RUNTIME_ENV_SNAPSHOTcode with atypeof encoder.runtimeEnv !== "object"guard. Aligns the "missing" and "wrong-type" paths to the same diagnostic.applyRuntimeEnvSnapshotreturns the prefix-filtered apply set inprevious, sorestore()is scoped to keys this call touched — keys the worker host added (e.g.HYPERFRAMES_EXTRACT_CACHE_DIR) stay untouched.
Net-new findings
None gating. The contract is enforced.
Verdict: APPROVE
Reasoning: Every prior finding addressed with citations, plus the workDir collision and silent runtimeEnv-fallback nits I flagged. Tests cover the new code paths (tampered-hash, OOB index, missing artifact, missing runtimeEnv). CI green on required checks; optional regression-shards still running.
— Vai
1dbd54a to
2fe3dde
Compare
The base branch was changed.
2fe3dde to
9e470d1
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-stamping after rebase. LGTM.
9e470d1 to
1dbd54a
Compare
1dbd54a to
e55a8d0
Compare
Merge activity
|
## What Phase 3 of the distributed rendering plan: the third public primitive. Adds `assemble(planDir, chunkPaths, audioPath, outputPath)` and its supporting types at `packages/producer/src/services/distributed/assemble.ts`. See `DISTRIBUTED-RENDERING-PLAN.md` §11 Phase 3. ## Why `plan()` (#808) and `renderChunk()` (#809) produce the planDir and the per-chunk outputs respectively. `assemble()` is what every distributed fan-out workflow runs last: it stitches the chunks into the final deliverable. Without it, the planDir → chunks chain stops at a list of files; nothing produces the user-facing mp4/mov/png-sequence. ## How `assemble()` branches on the planDir's encoder format: **mp4 / mov**: ffmpeg `-f concat -c copy` over the ordered chunk paths. Each chunk's first frame is an IDR keyframe (PR 3.2 set `lockGopForChunkConcat: true`), so concat-copy round-trips losslessly. The concatenated output is then: 1. Passed through `padOrTrimAudioToVideoFrameCount` (PR 2.7 surface) when `audioPath` is non-null, so audio length is exactly `frameCount / fps` rather than the audio mixer's original-duration output. 2. Muxed with the normalized audio via the engine's `muxVideoWithAudio` (same helper the in-process renderer's `assembleStage` uses). 3. Passed through `applyFaststart` so the `moov` atom moves to the file's start. When no audio is present, the concat output skips mux and goes straight to `applyFaststart`. **png-sequence**: chunks are directories of `frame_NNNNNN.png` files numbered locally per chunk. `assemble()` merges them with a continuous global index so chunk 0's `frame_000000.png` lands at `frame_000001.png` in the output, chunk 1's first frame becomes `frame_(N+1)`, etc. When `audioPath` is non-null we copy it alongside as `audio.aac` so callers who need to re-mux later have it. ### Validation Both branches assert `chunkPaths.length === chunks.length` (the value read from `meta/chunks.json`) and that each chunk path exists. A missing or mismatched manifest trips a typed error before any ffmpeg invocation. ### What did NOT change No engine helpers, no Phase 1 stages, no in-process orchestrator. `assemble()` reuses `muxVideoWithAudio` / `applyFaststart` / `runFfmpeg` / `padOrTrimAudioToVideoFrameCount` exactly as they exist — the in-process `runAssembleStage` is intentionally not called because it operates on a `RenderJob` and emits `updateJobStatus` payloads, neither of which the distributed activity has. ## Test plan - [x] Unit tests added — `packages/producer/src/services/distributed/assemble.test.ts`. 5 cases: - Concat-copies two mp4 chunks and applies faststart (ffprobe asserts codec, frame count, atom order). - Muxes audio with `frame-count-derived` duration when `audio.aac` is present (ffprobe asserts audio duration within 50ms of `totalFrames / fps`). - Merges png-sequence chunk directories with continuous global numbering (asserts filenames `frame_000001..frame_000007`). - Rejects mismatched `chunkPaths.length` vs `chunks.json.length`. - Rejects a planDir missing `plan.json`. - [x] `bun test packages/producer/src/services/distributed/` — 18 pass (PRs 3.1 + 3.2 + 3.3). - [x] `bun run --filter @hyperframes/producer typecheck` — clean. - [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files. - [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold. The mp4 fixture pre-renders test inputs via raw ffmpeg (`testsrc` filter + closed-GOP libx264) rather than going through the Chrome capture pipeline. This isolates concat-copy + mux + faststart from the renderChunk path that PRs 3.1/3.2 already cover, and avoids the chrome-headless-shell smoke-test gating that PR 3.2 needed. This is PR 3 of a 6-PR Phase 3 stack: - 3.1 — `services/distributed/plan.ts` (#808) - 3.2 — `services/distributed/renderChunk.ts` (#809) - **3.3 (this PR)** — `services/distributed/assemble.ts` - 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) - 3.5 — distributed format banlist (webm + HDR mp4) - 3.6 — public exports + `@hyperframes/producer/distributed` subpath 🤖 Generated with [Claude Code](https://claude.com/claude-code)
## What Phase 3 of the distributed rendering plan: §6.4 / §9.3 size cap. Extends `plan()` to measure the produced planDir's total byte size after freeze and throw a typed non-retryable `PlanTooLargeError` (`code === "PLAN_TOO_LARGE"`) when the planDir exceeds 2 GB. ## Why Distributed chunk workers ship the entire planDir to whatever ephemeral storage they're running on — `/tmp` on AWS Lambda (10 GB), the container filesystem on Cloud Run Jobs, etc. A planDir that doesn't fit can't be rendered. v1.5 lifts this cap via per-chunk video-frame slicing (§12); for now v1 fails fast at plan time so adapters don't waste a fan-out attempt that's guaranteed to OOM. The 2 GB ceiling specifically targets Lambda's 10 GB `/tmp`: planDir + per-chunk captured frames + ffmpeg's working set all share that budget, and 2 GB leaves ~8 GB for capture/encode at 4K SDR. ## How - New exports in `services/distributed/plan.ts`: - `PLAN_DIR_SIZE_LIMIT_BYTES` — the 2 GB constant. - `PLAN_TOO_LARGE` — the non-retryable error code (matches §9.3). - `PlanTooLargeError` — typed error class carrying `code`, `sizeBytes`, `limitBytes`, and a message that points adopters at the v1.5 slicing roadmap + the in-process renderer escape hatch. - `measurePlanDirBytes(planDir)` — recursive on-disk size walker. Symlinks skipped intentionally. - `DistributedRenderConfig.planDirSizeLimitBytes?: number` — optional override. Defaults to `PLAN_DIR_SIZE_LIMIT_BYTES`. Tests pass a tiny cap (1024 bytes) to exercise the throw path without filling 2 GB of /tmp. - The check runs in `plan()` AFTER the temp work tree is removed (so `.plan-work/` doesn't double-count) but BEFORE the function returns — adapters that catch the error never see a `PlanResult`. ### What did NOT change `executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. Only `plan()` (which is itself opt-in) enforces the cap. ## Test plan - [x] Unit tests added — `packages/producer/src/services/distributed/planSizeCap.test.ts`. 7 cases: - `measurePlanDirBytes` returns 0 for an empty dir, sums recursively, and gracefully ignores broken entries. - `PLAN_DIR_SIZE_LIMIT_BYTES` is `2 * 1024 * 1024 * 1024` (§6.4 pin). - `PlanTooLargeError` carries the `PLAN_TOO_LARGE` code + `sizeBytes` + `limitBytes` and mentions the v1.5 escape hatch. - `plan()` throws `PlanTooLargeError` when configured with a 1024-byte ceiling. - `plan()` succeeds when the default 2 GB ceiling is well above the produced planDir. - [x] `bun test packages/producer/src/services/distributed/` — 25 pass (PRs 3.1 + 3.2 + 3.3 + 3.4). - [x] `bun run --filter @hyperframes/producer typecheck` — clean. - [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files. - [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold. This is PR 4 of a 6-PR Phase 3 stack: - 3.1 — `services/distributed/plan.ts` (#808) - 3.2 — `services/distributed/renderChunk.ts` (#809) - 3.3 — `services/distributed/assemble.ts` (#813) - **3.4 (this PR)** — `planDir` size cap (`PLAN_TOO_LARGE`) - 3.5 — distributed format banlist (webm + HDR mp4) - 3.6 — public exports + `@hyperframes/producer/distributed` subpath 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…4) (#815) ## What Phase 3 of the distributed rendering plan: §11 PR 3.5 format banlist. Extends `plan()` to refuse two v1-unsupported formats up front with a typed non-retryable `FormatNotSupportedInDistributedError` (`code === "FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED"`). ## Why Both webm and HDR mp4 are documented as deferred to v1.5 (§7.2 + §12), but until this PR the only signal at the runtime layer is the in-process pipeline silently producing wrong output (chunk concat-copy doesn't round-trip VP9; HDR signaling gets stripped at the chunk boundary). Failing fast at `plan()` time keeps adopters from spending fan-out compute on a render that can't succeed and gives them a typed error code their workflow adapter can route on. ## How - New exports in `services/distributed/plan.ts`: - `FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED` — non-retryable error code matching §11's wording. - `FormatNotSupportedInDistributedError` — typed error class with `code`, `format`, and `reason` fields. Message names the rejected format and tells adopters to fall back to the in-process renderer (`executeRenderJob`) which has full format support. - `rejectUnsupportedDistributedFormat(config)` — pure helper exported separately so adapters can run the same gate at their input layer (Step Functions input validation, Temporal workflow start) before the activity even runs. - `plan()` calls `rejectUnsupportedDistributedFormat(config)` as the first line of the function — BEFORE `mkdirSync(planDir)` so a banned input never produces a partial planDir. - Replaced the previous ad-hoc `if (hdrMode === "force-hdr") throw new Error(...)` with the typed error class. ### What did NOT change `executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. The in-process renderer continues to accept webm + HDR (its existing functionality). ## Test plan - [x] Unit tests added — `packages/producer/src/services/distributed/planFormatBanlist.test.ts`. 5 cases: - `rejectUnsupportedDistributedFormat` accepts the v1-supported formats (mp4, mov, png-sequence) with both `auto` and `force-sdr` hdrMode. - Rejects webm — error has `code === FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED`, `format === "webm"`, message mentions in-process renderer. - Rejects HDR mp4 (`hdrMode === "force-hdr"`) — error has `format === "mp4-hdr"`, message mentions HDR. - End-to-end via `plan()`: webm throws with no planDir leaking to disk. - End-to-end via `plan()`: HDR mp4 throws with no planDir leaking to disk. - [x] `bun test packages/producer/src/services/distributed/` — 30 pass. - [x] `bun run --filter @hyperframes/producer typecheck` — clean. - [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files. - [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold. This is PR 5 of a 6-PR Phase 3 stack: - 3.1 — `services/distributed/plan.ts` (#808) - 3.2 — `services/distributed/renderChunk.ts` (#809) - 3.3 — `services/distributed/assemble.ts` (#813) - 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) (#814) - **3.5 (this PR)** — distributed format banlist (webm + HDR mp4) - 3.6 — public exports + `@hyperframes/producer/distributed` subpath 🤖 Generated with [Claude Code](https://claude.com/claude-code)

What
Phase 3 of the distributed rendering plan: the second public primitive. Adds
renderChunk(planDir, chunkIndex, outputChunkPath)and its supporting types aspackages/producer/src/services/distributed/renderChunk.ts. SeeDISTRIBUTED-RENDERING-PLAN.md§11 Phase 3.Why
plan()(PR #808) produces the planDir.renderChunk()is what every chunk worker actually runs — Temporal activity, Lambda Step Functions Map task, Cloud Run Job invocation. Phase 1 extracted capture/encode stages; Phase 2 added the determinism-hardening flags those stages need. This PR is the first caller that flips themtruefor the capture + encode path (PR #808 was the first caller forcompileForRender's flag).How
renderChunkcomposes the pipeline:plan.json,meta/{composition,encoder,chunks}.json. Out-of-rangechunkIndex, missing artifacts, orbrowserGpuMode !== "software"trip a typedRenderChunkValidationErrorwithcode === PLAN_HASH_MISMATCH/BROWSER_GPU_NOT_SOFTWARE.readFfmpegVersion()matches againstplan.ffmpegVersion— any drift trips a non-retryableFFMPEG_VERSION_MISMATCHper §9.3.applyRuntimeEnvSnapshot(encoder.runtimeEnv)BEFORE the file server is created —RENDER_MODE_SCRIPTbakes those env vars into served HTML at module load.<planDir>/compiled/withbuildVirtualTimeShim({ seedRandomFromFrame: true })(PR 2.4 surface).createCaptureSessionwithlockWarmupTicks: true(PR 2.3 surface).assertSwiftShader(session.page, readWebGlVendorInfoFromCanvas)BEFOREinitializeSession. The defaultassertSwiftShaderreader navigates tochrome://gpu, whichchrome-headless-shellserves as an empty document on multiple builds we've tested; this PR threads a canvas +WEBGL_debug_renderer_inforeader through the helper's existingreadInfooverride so the assertion works on both regular Chrome andchrome-headless-shell. No fork of the Phase 2 helper.discardWarmupCapture(session, startFrame, startTime)once before the chunk's first real frame (PR 2.6 surface).runCaptureStagewithframeRange: { startFrame, endFrame }andworkerCount: 1. The new optionalframeRangefield onCaptureStageInputextends the sequential capture branch: per-frame TIMES use absolute composition frame indices (so virtual time matches an in-process render at that frame), file NAMES normalize to zero (so the encoder reads them without an-start_numberoverride). In-process callers omitframeRangeand get the existing[0, totalFrames)behavior verbatim.runEncodeStagewithlockGopForChunkConcat: true+gopSize: framesInChunk+hasAudio: false. Closed-GOP per §7.1 so concat-copy at assemble time round-trips losslessly. The two new optional fields are pass-through toEncoderOptions; in-process callers omit them.ChunkResult.Changes outside the new module
render/stages/captureStage.tsframeRange?: { startFrame, endFrame }onCaptureStageInput. Sequential branch only; rejectsworkerCount > 1 && frameRange.frameRange === undefined→ identical behavior to[0, totalFrames).render/stages/encodeStage.tslockGopForChunkConcat?+gopSize?onEncodeStageInput, pass-through toEncoderOptions.renderOrchestrator.ts, thehyperframes renderCLI, and the producer HTTP routes are untouched.Test plan
packages/producer/src/services/distributed/renderChunk.test.ts. 3 cases:ChunkResult.sha256matches.chunkIndexrejected before Chrome init.plan.jsonrejected before Chrome init.bun test packages/producer/src/services/distributed/— 13 pass (PR 3.1 + 3.2 combined).bun test packages/producer/src/— 315 pass, 1 fail. The one failure is the pre-existingwriteCompiledArtifacts — external assets on Windows drive-letter pathsflake that also fails on a clean checkout oforigin/main.bun run --filter @hyperframes/producer typecheck— clean.bunx oxlint+bunx oxfmt --check— clean on changed files.executeRenderJobis unchanged here; PSNR baselines should hold.Known limitation surfaced during testing
chrome-headless-shellon some dev/CI hosts can't initialize SwiftShader (the GL stack errors out beforeBeginFramecan run). The byte-identical retry test soft-skips on those hosts and the Docker harness (where the chrome-headless-shell build is matched to the planDir's ffmpegVersion) is the source of truth for the determinism contract. To exercise the test locally on a host with a workingchrome-headless-shell, the test fixture usesformat: "png-sequence"which forces screenshot mode and avoids the BeginFrame dependency.This is PR 2 of a 6-PR Phase 3 stack:
services/distributed/plan.ts(feat(producer): add services/distributed/plan.ts #808)services/distributed/renderChunk.tsservices/distributed/assemble.tsplanDirsize cap (PLAN_TOO_LARGE)@hyperframes/producer/distributedsubpath🤖 Generated with Claude Code