Skip to content

feat(producer): add services/distributed/assemble.ts#813

Merged
jrusso1020 merged 1 commit into
mainfrom
05-13-feat_producer_add_services_distributed_assemble.ts
May 14, 2026
Merged

feat(producer): add services/distributed/assemble.ts#813
jrusso1020 merged 1 commit into
mainfrom
05-13-feat_producer_add_services_distributed_assemble.ts

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 13, 2026

What

Phase 3 of the distributed rendering plan: the third public primitive. Adds assemble(planDir, chunkPaths, audioPath, outputPath) and its supporting types at packages/producer/src/services/distributed/assemble.ts. See DISTRIBUTED-RENDERING-PLAN.md §11 Phase 3.

Why

plan() (#808) and renderChunk() (#809) produce the planDir and the per-chunk outputs respectively. assemble() is what every distributed fan-out workflow runs last: it stitches the chunks into the final deliverable. Without it, the planDir → chunks chain stops at a list of files; nothing produces the user-facing mp4/mov/png-sequence.

How

assemble() branches on the planDir's encoder format:

mp4 / mov: ffmpeg -f concat -c copy over the ordered chunk paths. Each chunk's first frame is an IDR keyframe (PR 3.2 set lockGopForChunkConcat: true), so concat-copy round-trips losslessly. The concatenated output is then:

  1. Passed through padOrTrimAudioToVideoFrameCount (PR 2.7 surface) when audioPath is non-null, so audio length is exactly frameCount / fps rather than the audio mixer's original-duration output.
  2. Muxed with the normalized audio via the engine's muxVideoWithAudio (same helper the in-process renderer's assembleStage uses).
  3. Passed through applyFaststart so the moov atom moves to the file's start.

When no audio is present, the concat output skips mux and goes straight to applyFaststart.

png-sequence: chunks are directories of frame_NNNNNN.png files numbered locally per chunk. assemble() merges them with a continuous global index so chunk 0's frame_000000.png lands at frame_000001.png in the output, chunk 1's first frame becomes frame_(N+1), etc. When audioPath is non-null we copy it alongside as audio.aac so callers who need to re-mux later have it.

Validation

Both branches assert chunkPaths.length === chunks.length (the value read from meta/chunks.json) and that each chunk path exists. A missing or mismatched manifest trips a typed error before any ffmpeg invocation.

What did NOT change

No engine helpers, no Phase 1 stages, no in-process orchestrator. assemble() reuses muxVideoWithAudio / applyFaststart / runFfmpeg / padOrTrimAudioToVideoFrameCount exactly as they exist — the in-process runAssembleStage is intentionally not called because it operates on a RenderJob and emits updateJobStatus payloads, neither of which the distributed activity has.

Test plan

  • Unit tests added — packages/producer/src/services/distributed/assemble.test.ts. 5 cases:
    • Concat-copies two mp4 chunks and applies faststart (ffprobe asserts codec, frame count, atom order).
    • Muxes audio with frame-count-derived duration when audio.aac is present (ffprobe asserts audio duration within 50ms of totalFrames / fps).
    • Merges png-sequence chunk directories with continuous global numbering (asserts filenames frame_000001..frame_000007).
    • Rejects mismatched chunkPaths.length vs chunks.json.length.
    • Rejects a planDir missing plan.json.
  • bun test packages/producer/src/services/distributed/ — 18 pass (PRs 3.1 + 3.2 + 3.3).
  • bun run --filter @hyperframes/producer typecheck — clean.
  • bunx oxlint + bunx oxfmt --check — clean on changed files.
  • Producer Docker regression harness — pending CI. executeRenderJob is unchanged; PSNR baselines should hold.

The mp4 fixture pre-renders test inputs via raw ffmpeg (testsrc filter + closed-GOP libx264) rather than going through the Chrome capture pipeline. This isolates concat-copy + mux + faststart from the renderChunk path that PRs 3.1/3.2 already cover, and avoids the chrome-headless-shell smoke-test gating that PR 3.2 needed.

This is PR 3 of a 6-PR Phase 3 stack:

🤖 Generated with Claude Code

miguel-heygen
miguel-heygen previously approved these changes May 13, 2026
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: assemble.ts — Distributed Chunk Assembly

Clean implementation. Successfully avoids the re-render bottleneck — audio arrives pre-rendered via audioPath, padOrTrim normalizes to video length, mux stream-copies video + re-encodes audio at 192k AAC. Concat demuxer usage is textbook.

Test coverage is comprehensive: mp4 concat + faststart byte-level verification, audio mux with duration assertion (50ms tolerance for AAC frame quantization), png-sequence merge with global renumbering, and two rejection cases.

Notes

  1. Silent audio skip (~line 104-106): When audioPath is non-null but the file doesn't exist, it silently skips audio. This could hide an upstream audio render failure. Consider at minimum logging a warning.

  2. Double faststart for mp4 with audio: muxVideoWithAudio adds +faststart internally, then assemble.ts:186 calls applyFaststart again. Idempotent but wastes one ffmpeg invocation. Consider skipping applyFaststart when audio was muxed.

  3. console.warn vs structured log: mergePngFrameDirs frame count mismatch uses console.warn instead of the structured log parameter used elsewhere in the module.

All minor — none blocking. LGTM.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 3 Activity C — assemble stitches chunks into the final deliverable. Concat-copy on mp4/mov is the right call given the chunk encoder's lockGopForChunkConcat: true + gopSize === framesInChunk from #809, and the png-sequence directory merge with global re-numbering is clean. A few real concerns around audio handling and recoverability.

Strengths

  • assemble.ts:500path.replace(/'/g, "'\\''") is the correct ffmpeg-concat-demuxer single-quote escape. Spelled out so it survives the next refactor.
  • assemble.ts:526-543 — running padOrTrimAudioToVideoFrameCount BEFORE mux is the right ordering. Audio drift at the end of long renders is exactly the failure mode the helper exists to catch, and feeding the concat output's actual frame count (rather than plan.totalFrames) keeps the normalization grounded in what was actually rendered.
  • assemble.ts:599-647 — the png-sequence merge does global re-numbering across chunk boundaries (globalIdx + 1) so consumers see one continuous sequence rather than per-chunk numbering. Matches the in-process renderer's convention.

Findings

important — assemble accepts a null audioPath even when plan.hasAudio === true (assemble.ts:447-543)

The caller passes audioPath: string | null. If the composition has audio (plan.hasAudio === true) but the caller passes null, the function silently skips the mux and ships a video with no audio track — no warning, no error, no diagnostic. The mismatch is exactly the kind of contract violation assemble should catch at its trust boundary:

if (plan.hasAudio && audioPath === null) {
  throw new Error("[assemble] planDir has audio but caller passed audioPath=null");
}
if (!plan.hasAudio && audioPath !== null) {
  // Warn at minimum — caller is passing audio for a no-audio plan,
  // which suggests a stale planDir / mismatched caller config.
}

Without this, a workflow adapter bug (e.g., forgetting to download audio.aac from S3 before calling assemble) ships a silently-broken deliverable. The five-test coverage in assemble.test.ts doesn't exercise this case.

important — mergePngFrameDirs warns on frame-count mismatch but proceeds (assemble.ts:629-639)

if (globalIdx !== totalFrames) {
  console.warn(
    `[assemble] png-sequence frame count mismatch: merged ${globalIdx} frames vs ` +
      `plan.totalFrames=${totalFrames}. Using on-disk count.`,
  );
}

The comment says "consumers should rely on the on-disk count" but the result struct returns framesEncoded: globalIdx (the on-disk count). So the structured return is correct but the console.warn is unstructured — it won't show up in workflow adapter logs, won't surface to the user, won't trip the workflow's failure path.

The justification ("can differ by ±1 across hosts in edge cases") is fair for small drifts, but ±1 is also exactly the value that masks "one of the chunks silently produced N-1 frames because the encoder choked." The drift threshold matters. Two cleaner shapes:

  1. Tight: throw on any mismatch. The byte-identical-retry contract from #809 makes this safe — if a chunk produces the wrong frame count it's deterministic, so the mismatch is a real bug worth surfacing.
  2. Tolerant: allow Math.abs(globalIdx - totalFrames) <= 1, throw otherwise. Document the ±1 tolerance and which paths produce it.

I'd take (1) and let real edge cases surface via test failures so the cause can be diagnosed instead of buried in a warning.

important — the mp4/mov finally block on line 568-577 doesn't clean up outputPath on partial failure

The finally removes workDir (good), but if the faststart step fails after the concat + mux succeeded, outputPath is left as either a partial file (if applyFaststart was mid-write) or the mux output (if faststart hadn't started). Callers reading outputPath after a failed assemble() get garbage they think is valid. Either:

  • Atomic write: faststart to a .tmp path, rename to outputPath on success.
  • Cleanup outputPath in the same finally when the outer try throws.

The "callers handle this" answer is fine if documented; it's not currently documented.

nit — assemble.ts:566applyFaststart is documented as "no-op for .mov" but the chained call still writes through

Comment at line 562: "applyFaststart is a no-op for .mov (it copies the input to output)". A copy is not free — it's a full read of muxOutputPath and a full write of outputPath. For a 4K SDR 1-minute mov at ProRes 4444 that's ~10 GB. If there's no functional reason to copy on .mov, skip the call and renameSync(muxOutputPath, outputPath) instead — same outcome at no cost.

nit — assemble doesn't verify chunkPaths point at outputs from THIS planDir

assemble.ts:476-480 checks paths exist but doesn't check the perf sidecars (which carry planHash) match plan.planHash. A caller pointing at chunks from a different planDir produces undefined output. Read the perf sidecars and assert planHash matches; cheap (it's already on disk per #809).

Verdict: APPROVE
Reasoning: The concat-copy + mux + faststart sequence is correct and well-grounded in the chunk encoder's GOP discipline. The findings above are real but recoverable as follow-ups — none corrupt the happy-path output, they just leave silent-failure surfaces around it. Worth landing now and fixing the trust-boundary gaps in a focused follow-up.

— Vai

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch from 8074417 to 24f0ffe Compare May 14, 2026 00:14
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from c82179f to 1dbd54a Compare May 14, 2026 00:14
vanceingalls
vanceingalls previously approved these changes May 14, 2026
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review

PR rebased onto the new #808/#809 bases (which shifted the shared.ts helpers and added the env-restore + planHash-validation contracts to renderChunk). The assemble.ts content itself didn't change between SHAs 5f0239a24f0ffea; the diff is identical to what I previously approved.

Net-new findings

Note (catch from a fresh read — not blocking): assemble.ts:491const workDir = \${outputPath}.assemble-work`;has the same race I flagged onrenderChunk(concurrent invocations on the sameoutputPathwouldrmSynceach other's work). Assemble is controller-side so the realistic risk is lower than forrenderChunk workers, but the same suffix pattern (pid + randomBytes`) would be cheap to apply if the controller ever fan-outs more than once. Nit, not gating — flagging for parity with the renderChunk fix.

Verdict: APPROVE
Reasoning: No content change since prior approval; prior findings non-existent. Catch on the workDir suffix is a nit, easily a follow-up. CI required checks green.

— Vai

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch 2 times, most recently from 2fe3dde to 9e470d1 Compare May 14, 2026 00:47
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch 2 times, most recently from d0cd7f5 to 6b8d1d1 Compare May 14, 2026 01:13
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_renderchunk.ts branch from 9e470d1 to 1dbd54a Compare May 14, 2026 01:13
@jrusso1020 jrusso1020 changed the base branch from 05-13-feat_producer_add_services_distributed_renderchunk.ts to graphite-base/813 May 14, 2026 01:44
@jrusso1020 jrusso1020 force-pushed the graphite-base/813 branch from 1dbd54a to 602ef87 Compare May 14, 2026 02:09
@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch from 6b8d1d1 to 35239ea Compare May 14, 2026 02:09
@graphite-app graphite-app Bot changed the base branch from graphite-base/813 to main May 14, 2026 02:10
@graphite-app graphite-app Bot dismissed stale reviews from vanceingalls and miguel-heygen May 14, 2026 02:10

The base branch was changed.

@jrusso1020 jrusso1020 force-pushed the 05-13-feat_producer_add_services_distributed_assemble.ts branch from 35239ea to b606106 Compare May 14, 2026 02:10
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-stamping after rebase. LGTM.

@jrusso1020 jrusso1020 merged commit 3468db6 into main May 14, 2026
39 checks passed
Copy link
Copy Markdown
Collaborator Author

Merge activity

@jrusso1020 jrusso1020 deleted the 05-13-feat_producer_add_services_distributed_assemble.ts branch May 14, 2026 02:38
jrusso1020 added a commit that referenced this pull request May 14, 2026
## What

Phase 3 of the distributed rendering plan: §6.4 / §9.3 size cap. Extends `plan()` to measure the produced planDir's total byte size after freeze and throw a typed non-retryable `PlanTooLargeError` (`code === "PLAN_TOO_LARGE"`) when the planDir exceeds 2 GB.

## Why

Distributed chunk workers ship the entire planDir to whatever ephemeral storage they're running on — `/tmp` on AWS Lambda (10 GB), the container filesystem on Cloud Run Jobs, etc. A planDir that doesn't fit can't be rendered. v1.5 lifts this cap via per-chunk video-frame slicing (§12); for now v1 fails fast at plan time so adapters don't waste a fan-out attempt that's guaranteed to OOM.

The 2 GB ceiling specifically targets Lambda's 10 GB `/tmp`: planDir + per-chunk captured frames + ffmpeg's working set all share that budget, and 2 GB leaves ~8 GB for capture/encode at 4K SDR.

## How

- New exports in `services/distributed/plan.ts`:
  - `PLAN_DIR_SIZE_LIMIT_BYTES` — the 2 GB constant.
  - `PLAN_TOO_LARGE` — the non-retryable error code (matches §9.3).
  - `PlanTooLargeError` — typed error class carrying `code`, `sizeBytes`, `limitBytes`, and a message that points adopters at the v1.5 slicing roadmap + the in-process renderer escape hatch.
  - `measurePlanDirBytes(planDir)` — recursive on-disk size walker. Symlinks skipped intentionally.
- `DistributedRenderConfig.planDirSizeLimitBytes?: number` — optional override. Defaults to `PLAN_DIR_SIZE_LIMIT_BYTES`. Tests pass a tiny cap (1024 bytes) to exercise the throw path without filling 2 GB of /tmp.
- The check runs in `plan()` AFTER the temp work tree is removed (so `.plan-work/` doesn't double-count) but BEFORE the function returns — adapters that catch the error never see a `PlanResult`.

### What did NOT change

`executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. Only `plan()` (which is itself opt-in) enforces the cap.

## Test plan

- [x] Unit tests added — `packages/producer/src/services/distributed/planSizeCap.test.ts`. 7 cases:
  - `measurePlanDirBytes` returns 0 for an empty dir, sums recursively, and gracefully ignores broken entries.
  - `PLAN_DIR_SIZE_LIMIT_BYTES` is `2 * 1024 * 1024 * 1024` (§6.4 pin).
  - `PlanTooLargeError` carries the `PLAN_TOO_LARGE` code + `sizeBytes` + `limitBytes` and mentions the v1.5 escape hatch.
  - `plan()` throws `PlanTooLargeError` when configured with a 1024-byte ceiling.
  - `plan()` succeeds when the default 2 GB ceiling is well above the produced planDir.
- [x] `bun test packages/producer/src/services/distributed/` — 25 pass (PRs 3.1 + 3.2 + 3.3 + 3.4).
- [x] `bun run --filter @hyperframes/producer typecheck` — clean.
- [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files.
- [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold.

This is PR 4 of a 6-PR Phase 3 stack:

- 3.1 — `services/distributed/plan.ts` (#808)
- 3.2 — `services/distributed/renderChunk.ts` (#809)
- 3.3 — `services/distributed/assemble.ts` (#813)
- **3.4 (this PR)** — `planDir` size cap (`PLAN_TOO_LARGE`)
- 3.5 — distributed format banlist (webm + HDR mp4)
- 3.6 — public exports + `@hyperframes/producer/distributed` subpath

🤖 Generated with [Claude Code](https://claude.com/claude-code)
jrusso1020 added a commit that referenced this pull request May 14, 2026
…4) (#815)

## What

Phase 3 of the distributed rendering plan: §11 PR 3.5 format banlist. Extends `plan()` to refuse two v1-unsupported formats up front with a typed non-retryable `FormatNotSupportedInDistributedError` (`code === "FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED"`).

## Why

Both webm and HDR mp4 are documented as deferred to v1.5 (§7.2 + §12), but until this PR the only signal at the runtime layer is the in-process pipeline silently producing wrong output (chunk concat-copy doesn't round-trip VP9; HDR signaling gets stripped at the chunk boundary). Failing fast at `plan()` time keeps adopters from spending fan-out compute on a render that can't succeed and gives them a typed error code their workflow adapter can route on.

## How

- New exports in `services/distributed/plan.ts`:
  - `FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED` — non-retryable error code matching §11's wording.
  - `FormatNotSupportedInDistributedError` — typed error class with `code`, `format`, and `reason` fields. Message names the rejected format and tells adopters to fall back to the in-process renderer (`executeRenderJob`) which has full format support.
  - `rejectUnsupportedDistributedFormat(config)` — pure helper exported separately so adapters can run the same gate at their input layer (Step Functions input validation, Temporal workflow start) before the activity even runs.
- `plan()` calls `rejectUnsupportedDistributedFormat(config)` as the first line of the function — BEFORE `mkdirSync(planDir)` so a banned input never produces a partial planDir.
- Replaced the previous ad-hoc `if (hdrMode === "force-hdr") throw new Error(...)` with the typed error class.

### What did NOT change

`executeRenderJob`, the in-process orchestrator, the `hyperframes render` CLI, producer HTTP routes — all unchanged. The in-process renderer continues to accept webm + HDR (its existing functionality).

## Test plan

- [x] Unit tests added — `packages/producer/src/services/distributed/planFormatBanlist.test.ts`. 5 cases:
  - `rejectUnsupportedDistributedFormat` accepts the v1-supported formats (mp4, mov, png-sequence) with both `auto` and `force-sdr` hdrMode.
  - Rejects webm — error has `code === FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED`, `format === "webm"`, message mentions in-process renderer.
  - Rejects HDR mp4 (`hdrMode === "force-hdr"`) — error has `format === "mp4-hdr"`, message mentions HDR.
  - End-to-end via `plan()`: webm throws with no planDir leaking to disk.
  - End-to-end via `plan()`: HDR mp4 throws with no planDir leaking to disk.
- [x] `bun test packages/producer/src/services/distributed/` — 30 pass.
- [x] `bun run --filter @hyperframes/producer typecheck` — clean.
- [x] `bunx oxlint` + `bunx oxfmt --check` — clean on changed files.
- [ ] Producer Docker regression harness — pending CI. `executeRenderJob` is unchanged; PSNR baselines should hold.

This is PR 5 of a 6-PR Phase 3 stack:

- 3.1 — `services/distributed/plan.ts` (#808)
- 3.2 — `services/distributed/renderChunk.ts` (#809)
- 3.3 — `services/distributed/assemble.ts` (#813)
- 3.4 — `planDir` size cap (`PLAN_TOO_LARGE`) (#814)
- **3.5 (this PR)** — distributed format banlist (webm + HDR mp4)
- 3.6 — public exports + `@hyperframes/producer/distributed` subpath

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants