feat(producer): add services/distributed/assemble.ts#813
Conversation
c683dbc to
c82179f
Compare
af0336e to
8074417
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: assemble.ts — Distributed Chunk Assembly
Clean implementation. Successfully avoids the re-render bottleneck — audio arrives pre-rendered via audioPath, padOrTrim normalizes to video length, mux stream-copies video + re-encodes audio at 192k AAC. Concat demuxer usage is textbook.
Test coverage is comprehensive: mp4 concat + faststart byte-level verification, audio mux with duration assertion (50ms tolerance for AAC frame quantization), png-sequence merge with global renumbering, and two rejection cases.
Notes
-
Silent audio skip (~line 104-106): When
audioPathis non-null but the file doesn't exist, it silently skips audio. This could hide an upstream audio render failure. Consider at minimum logging a warning. -
Double faststart for mp4 with audio:
muxVideoWithAudioadds+faststartinternally, thenassemble.ts:186callsapplyFaststartagain. Idempotent but wastes one ffmpeg invocation. Consider skippingapplyFaststartwhen audio was muxed. -
console.warnvs structured log:mergePngFrameDirsframe count mismatch usesconsole.warninstead of the structuredlogparameter used elsewhere in the module.
All minor — none blocking. LGTM.
vanceingalls
left a comment
There was a problem hiding this comment.
Phase 3 Activity C — assemble stitches chunks into the final deliverable. Concat-copy on mp4/mov is the right call given the chunk encoder's lockGopForChunkConcat: true + gopSize === framesInChunk from #809, and the png-sequence directory merge with global re-numbering is clean. A few real concerns around audio handling and recoverability.
Strengths
assemble.ts:500—path.replace(/'/g, "'\\''")is the correct ffmpeg-concat-demuxer single-quote escape. Spelled out so it survives the next refactor.assemble.ts:526-543— runningpadOrTrimAudioToVideoFrameCountBEFORE mux is the right ordering. Audio drift at the end of long renders is exactly the failure mode the helper exists to catch, and feeding the concat output's actual frame count (rather thanplan.totalFrames) keeps the normalization grounded in what was actually rendered.assemble.ts:599-647— the png-sequence merge does global re-numbering across chunk boundaries (globalIdx + 1) so consumers see one continuous sequence rather than per-chunk numbering. Matches the in-process renderer's convention.
Findings
important — assemble accepts a null audioPath even when plan.hasAudio === true (assemble.ts:447-543)
The caller passes audioPath: string | null. If the composition has audio (plan.hasAudio === true) but the caller passes null, the function silently skips the mux and ships a video with no audio track — no warning, no error, no diagnostic. The mismatch is exactly the kind of contract violation assemble should catch at its trust boundary:
if (plan.hasAudio && audioPath === null) {
throw new Error("[assemble] planDir has audio but caller passed audioPath=null");
}
if (!plan.hasAudio && audioPath !== null) {
// Warn at minimum — caller is passing audio for a no-audio plan,
// which suggests a stale planDir / mismatched caller config.
}Without this, a workflow adapter bug (e.g., forgetting to download audio.aac from S3 before calling assemble) ships a silently-broken deliverable. The five-test coverage in assemble.test.ts doesn't exercise this case.
important — mergePngFrameDirs warns on frame-count mismatch but proceeds (assemble.ts:629-639)
if (globalIdx !== totalFrames) {
console.warn(
`[assemble] png-sequence frame count mismatch: merged ${globalIdx} frames vs ` +
`plan.totalFrames=${totalFrames}. Using on-disk count.`,
);
}The comment says "consumers should rely on the on-disk count" but the result struct returns framesEncoded: globalIdx (the on-disk count). So the structured return is correct but the console.warn is unstructured — it won't show up in workflow adapter logs, won't surface to the user, won't trip the workflow's failure path.
The justification ("can differ by ±1 across hosts in edge cases") is fair for small drifts, but ±1 is also exactly the value that masks "one of the chunks silently produced N-1 frames because the encoder choked." The drift threshold matters. Two cleaner shapes:
- Tight: throw on any mismatch. The byte-identical-retry contract from #809 makes this safe — if a chunk produces the wrong frame count it's deterministic, so the mismatch is a real bug worth surfacing.
- Tolerant: allow
Math.abs(globalIdx - totalFrames) <= 1, throw otherwise. Document the ±1 tolerance and which paths produce it.
I'd take (1) and let real edge cases surface via test failures so the cause can be diagnosed instead of buried in a warning.
important — the mp4/mov finally block on line 568-577 doesn't clean up outputPath on partial failure
The finally removes workDir (good), but if the faststart step fails after the concat + mux succeeded, outputPath is left as either a partial file (if applyFaststart was mid-write) or the mux output (if faststart hadn't started). Callers reading outputPath after a failed assemble() get garbage they think is valid. Either:
- Atomic write: faststart to a
.tmppath,renametooutputPathon success. - Cleanup
outputPathin the samefinallywhen the outer try throws.
The "callers handle this" answer is fine if documented; it's not currently documented.
nit — assemble.ts:566 — applyFaststart is documented as "no-op for .mov" but the chained call still writes through
Comment at line 562: "applyFaststart is a no-op for .mov (it copies the input to output)". A copy is not free — it's a full read of muxOutputPath and a full write of outputPath. For a 4K SDR 1-minute mov at ProRes 4444 that's ~10 GB. If there's no functional reason to copy on .mov, skip the call and renameSync(muxOutputPath, outputPath) instead — same outcome at no cost.
nit — assemble doesn't verify chunkPaths point at outputs from THIS planDir
assemble.ts:476-480 checks paths exist but doesn't check the perf sidecars (which carry planHash) match plan.planHash. A caller pointing at chunks from a different planDir produces undefined output. Read the perf sidecars and assert planHash matches; cheap (it's already on disk per #809).
Verdict: APPROVE
Reasoning: The concat-copy + mux + faststart sequence is correct and well-grounded in the chunk encoder's GOP discipline. The findings above are real but recoverable as follow-ups — none corrupt the happy-path output, they just leave silent-failure surfaces around it. Worth landing now and fixing the trust-boundary gaps in a focused follow-up.
— Vai
8074417 to
24f0ffe
Compare
c82179f to
1dbd54a
Compare
vanceingalls
left a comment
There was a problem hiding this comment.
Re-review
PR rebased onto the new #808/#809 bases (which shifted the shared.ts helpers and added the env-restore + planHash-validation contracts to renderChunk). The assemble.ts content itself didn't change between SHAs 5f0239a → 24f0ffea; the diff is identical to what I previously approved.
Net-new findings
Note (catch from a fresh read — not blocking): assemble.ts:491 — const workDir = \${outputPath}.assemble-work`;has the same race I flagged onrenderChunk(concurrent invocations on the sameoutputPathwouldrmSynceach other's work). Assemble is controller-side so the realistic risk is lower than forrenderChunk workers, but the same suffix pattern (pid + randomBytes`) would be cheap to apply if the controller ever fan-outs more than once. Nit, not gating — flagging for parity with the renderChunk fix.
Verdict: APPROVE
Reasoning: No content change since prior approval; prior findings non-existent. Catch on the workDir suffix is a nit, easily a follow-up. CI required checks green.
— Vai
2fe3dde to
9e470d1
Compare
d0cd7f5 to
6b8d1d1
Compare
9e470d1 to
1dbd54a
Compare
1dbd54a to
602ef87
Compare
6b8d1d1 to
35239ea
Compare
The base branch was changed.
35239ea to
b606106
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Re-stamping after rebase. LGTM.
Merge activity
|
## What Phase 3 of the distributed rendering plan: §6.4 / §9.3 size cap. Extends `plan()` to measure the produced planDir's total byte size after freeze and throw a typed non-retryable `PlanTooLargeError` (`code === "PLAN_TOO_LARGE"`) when the planDir exceeds 2 GB. ## Why Distributed chunk workers ship the entire planDir to whatever ephemeral storage they're running on — `/tmp` on AWS Lambda (10 GB), the container filesystem on Cloud Run Jobs, etc. A planDir that doesn't fit can't be rendered. v1.5 lifts this cap via per-chunk video-frame slicing (§12); for now v1 fails fast at plan time so adapters don't waste a fan-out attempt that's guaranteed to OOM. The 2 GB ceiling specifically targets Lambda's 10 GB `/tmp`: planDir + per-chunk captured frames + ffmpeg's working set all share that budget, and 2 GB leaves ~8 GB for capture/encode at 4K SDR. ## How - New exports in `services/distributed/plan.ts`: - `PLAN_DIR_SIZE_LIMIT_BYTES` — the 2 GB constant. - `PLAN_TOO_LARGE` — the non-retryable error code (matches §9.3). - `PlanTooLargeError` — typed error class carrying `code`, `sizeBytes`, `limitBytes`, and a message that points adopters at the v1.5 slicing roadmap + the in-process renderer escape hatch. - `measurePlanDirBytes(planDir)` — recursive on-disk size walker. Symlinks skipped intentionally. - `DistributedRenderConfig.planDirSizeLimitBytes?: number` — optional override. Defaults to `PLAN_DIR_SIZE_LIMIT_BYTES`. Tests pass a tiny cap (1024 bytes) to exercise the throw path without filling 2 GB of /tmp. - The check runs in `plan()` AFTER the temp work tree is removed (so `.plan-work/` doesn't double-count) but BEFORE the function returns — adapters that catch the error never see a `PlanResult`. ### What did NOT change `executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. Only `plan()` (which is itself opt-in) enforces the cap. ## Test plan - [x] Unit tests added — `packages/producer/src/services/distributed/planSizeCap.test.ts`. 7 cases: - `measurePlanDirBytes` returns 0 for an empty dir, sums recursively, and gracefully ignores broken entries. - `PLAN_DIR_SIZE_LIMIT_BYTES` is `2 * 1024 * 1024 * 1024` (§6.4 pin). - `PlanTooLargeError` carries the `PLAN_TOO_LARGE` code + `sizeBytes` + `limitBytes` and mentions the v1.5 escape hatch. - `plan()` throws `PlanTooLargeError` when configured with a 1024-byte ceiling. - `plan()` succeeds when the default 2 GB ceiling is well above the produced planDir. - [x] `bun test packages/producer/src/services/distributed/` — 25 pass (PRs 3.1 + 3.2 + 3.3 + 3.4). - [x] `bun run --filter @hyperframes/producer typecheck` — clean. - [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files. - [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold. This is PR 4 of a 6-PR Phase 3 stack: - 3.1 — `services/distributed/plan.ts` (#808) - 3.2 — `services/distributed/renderChunk.ts` (#809) - 3.3 — `services/distributed/assemble.ts` (#813) - **3.4 (this PR)** — `planDir` size cap (`PLAN_TOO_LARGE`) - 3.5 — distributed format banlist (webm + HDR mp4) - 3.6 — public exports + `@hyperframes/producer/distributed` subpath 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…4) (#815) ## What Phase 3 of the distributed rendering plan: §11 PR 3.5 format banlist. Extends `plan()` to refuse two v1-unsupported formats up front with a typed non-retryable `FormatNotSupportedInDistributedError` (`code === "FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED"`). ## Why Both webm and HDR mp4 are documented as deferred to v1.5 (§7.2 + §12), but until this PR the only signal at the runtime layer is the in-process pipeline silently producing wrong output (chunk concat-copy doesn't round-trip VP9; HDR signaling gets stripped at the chunk boundary). Failing fast at `plan()` time keeps adopters from spending fan-out compute on a render that can't succeed and gives them a typed error code their workflow adapter can route on. ## How - New exports in `services/distributed/plan.ts`: - `FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED` — non-retryable error code matching §11's wording. - `FormatNotSupportedInDistributedError` — typed error class with `code`, `format`, and `reason` fields. Message names the rejected format and tells adopters to fall back to the in-process renderer (`executeRenderJob`) which has full format support. - `rejectUnsupportedDistributedFormat(config)` — pure helper exported separately so adapters can run the same gate at their input layer (Step Functions input validation, Temporal workflow start) before the activity even runs. - `plan()` calls `rejectUnsupportedDistributedFormat(config)` as the first line of the function — BEFORE `mkdirSync(planDir)` so a banned input never produces a partial planDir. - Replaced the previous ad-hoc `if (hdrMode === "force-hdr") throw new Error(...)` with the typed error class. ### What did NOT change `executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. The in-process renderer continues to accept webm + HDR (its existing functionality). ## Test plan - [x] Unit tests added — `packages/producer/src/services/distributed/planFormatBanlist.test.ts`. 5 cases: - `rejectUnsupportedDistributedFormat` accepts the v1-supported formats (mp4, mov, png-sequence) with both `auto` and `force-sdr` hdrMode. - Rejects webm — error has `code === FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED`, `format === "webm"`, message mentions in-process renderer. - Rejects HDR mp4 (`hdrMode === "force-hdr"`) — error has `format === "mp4-hdr"`, message mentions HDR. - End-to-end via `plan()`: webm throws with no planDir leaking to disk. - End-to-end via `plan()`: HDR mp4 throws with no planDir leaking to disk. - [x] `bun test packages/producer/src/services/distributed/` — 30 pass. - [x] `bun run --filter @hyperframes/producer typecheck` — clean. - [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files. - [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold. This is PR 5 of a 6-PR Phase 3 stack: - 3.1 — `services/distributed/plan.ts` (#808) - 3.2 — `services/distributed/renderChunk.ts` (#809) - 3.3 — `services/distributed/assemble.ts` (#813) - 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) (#814) - **3.5 (this PR)** — distributed format banlist (webm + HDR mp4) - 3.6 — public exports + `@hyperframes/producer/distributed` subpath 🤖 Generated with [Claude Code](https://claude.com/claude-code)

What
Phase 3 of the distributed rendering plan: the third public primitive. Adds
assemble(planDir, chunkPaths, audioPath, outputPath)and its supporting types atpackages/producer/src/services/distributed/assemble.ts. SeeDISTRIBUTED-RENDERING-PLAN.md§11 Phase 3.Why
plan()(#808) andrenderChunk()(#809) produce the planDir and the per-chunk outputs respectively.assemble()is what every distributed fan-out workflow runs last: it stitches the chunks into the final deliverable. Without it, the planDir → chunks chain stops at a list of files; nothing produces the user-facing mp4/mov/png-sequence.How
assemble()branches on the planDir's encoder format:mp4 / mov: ffmpeg
-f concat -c copyover the ordered chunk paths. Each chunk's first frame is an IDR keyframe (PR 3.2 setlockGopForChunkConcat: true), so concat-copy round-trips losslessly. The concatenated output is then:padOrTrimAudioToVideoFrameCount(PR 2.7 surface) whenaudioPathis non-null, so audio length is exactlyframeCount / fpsrather than the audio mixer's original-duration output.muxVideoWithAudio(same helper the in-process renderer'sassembleStageuses).applyFaststartso themoovatom moves to the file's start.When no audio is present, the concat output skips mux and goes straight to
applyFaststart.png-sequence: chunks are directories of
frame_NNNNNN.pngfiles numbered locally per chunk.assemble()merges them with a continuous global index so chunk 0'sframe_000000.pnglands atframe_000001.pngin the output, chunk 1's first frame becomesframe_(N+1), etc. WhenaudioPathis non-null we copy it alongside asaudio.aacso callers who need to re-mux later have it.Validation
Both branches assert
chunkPaths.length === chunks.length(the value read frommeta/chunks.json) and that each chunk path exists. A missing or mismatched manifest trips a typed error before any ffmpeg invocation.What did NOT change
No engine helpers, no Phase 1 stages, no in-process orchestrator.
assemble()reusesmuxVideoWithAudio/applyFaststart/runFfmpeg/padOrTrimAudioToVideoFrameCountexactly as they exist — the in-processrunAssembleStageis intentionally not called because it operates on aRenderJoband emitsupdateJobStatuspayloads, neither of which the distributed activity has.Test plan
packages/producer/src/services/distributed/assemble.test.ts. 5 cases:frame-count-derivedduration whenaudio.aacis present (ffprobe asserts audio duration within 50ms oftotalFrames / fps).frame_000001..frame_000007).chunkPaths.lengthvschunks.json.length.plan.json.bun test packages/producer/src/services/distributed/— 18 pass (PRs 3.1 + 3.2 + 3.3).bun run --filter @hyperframes/producer typecheck— clean.bunx oxlint+bunx oxfmt --check— clean on changed files.executeRenderJobis unchanged; PSNR baselines should hold.The mp4 fixture pre-renders test inputs via raw ffmpeg (
testsrcfilter + closed-GOP libx264) rather than going through the Chrome capture pipeline. This isolates concat-copy + mux + faststart from the renderChunk path that PRs 3.1/3.2 already cover, and avoids the chrome-headless-shell smoke-test gating that PR 3.2 needed.This is PR 3 of a 6-PR Phase 3 stack:
services/distributed/plan.ts(feat(producer): add services/distributed/plan.ts #808)services/distributed/renderChunk.ts(feat(producer): add services/distributed/renderChunk.ts #809)services/distributed/assemble.tsplanDirsize cap (PLAN_TOO_LARGE)@hyperframes/producer/distributedsubpath🤖 Generated with Claude Code