Skip to content

feat(producer): add shaderTransitionWorkerPool (hf#732 PR 3/5)#758

Merged
vanceingalls merged 1 commit into
mainfrom
vai/677-3-shader-pool
May 13, 2026
Merged

feat(producer): add shaderTransitionWorkerPool (hf#732 PR 3/5)#758
vanceingalls merged 1 commit into
mainfrom
vai/677-3-shader-pool

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

Summary

PR 3 of 5 in the hf#732 decomposition stack. Adds a worker_threads-based pool that runs the shader-transition blend (one of 15 transition shaders) on a fixed-size worker pool. No production wiring yet — the pool stands alone; PR 4 wires it.

The shader blend is a hot inner loop over every pixel of every transition frame at 16bpc. Moving it off the main event loop removes the JS-event-loop ceiling that capped throughput in earlier hf#732 iterations.

New files

  • packages/producer/src/services/shaderTransitionWorker.ts — worker entry. Imports from @hyperframes/engine/shader-transitions (zero-import TS source).
  • packages/producer/src/services/shaderTransitionWorkerPool.ts — fixed-size pool. Uses transferList so the 16bpc HDR from/to/out buffers move by ownership.
  • packages/producer/src/services/shaderTransitionWorkerPool.test.ts — 6 vitest tests pinning byte-equivalence across all 15 shaders, transferList correctness, pool lifecycle. All pass.

Build wiring

  • packages/cli/tsup.config.ts: third tsup entry emits dist/shaderTransitionWorker.js.
  • packages/producer/build.mjs: fourth esbuild entry for direct producer consumers.
  • packages/engine/package.json: adds ./shader-transitions subpath export.

Stack

Stacked on top of #757 (PR 2: pngDecodeBlit pool). No behavior change in any render.

Test plan

  • 6 pool tests pass
  • Producer + engine typecheck clean
  • oxlint clean

— Vai

jrusso1020
jrusso1020 previously approved these changes May 12, 2026
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE — clean standalone pool, structurally parallel to #757

Per Rule 5: required CI all green (Lint, Build, Test, Typecheck, regression, regression-shards, preview-regression, player-perf). Graphite mergeability_check in_progress (normal for stacked PR).

Audited: shaderTransitionWorkerPool.ts end-to-end (~320 lines).
Trusting: the 6 tests (byte-equivalence across all 15 shaders + transferList + lifecycle), shaderTransitionWorker.ts, the build wiring across cli/tsup, producer/build.mjs, and the engine ./shader-transitions subpath export — verified registration but didn't audit internals.

What I checked carefully

Same shape as PR2's pool — same patterns hold here:

  1. Lifecycle correctness (:282-308) — terminate() rejects queued tasks, then in-flight tasks, then Promise.all worker.terminate with swallowed errors. Idempotent guard at top.

  2. onWorkerError / onWorkerExit (:230-247) — buffers-lost framing in the rejection message tells the caller their transferred state can't be recovered. Correct.

  3. transferList contract documented at :33-37 — the most error-prone aspect (caller-must-use-result.*) is in the API JSDoc.

  4. buildExecArgv matches the PR2 pool's pattern for tsx-loader injection under vitest.

  5. Pool size clamping to [1, cpus().length] (:175-176).

One observation worth knowing (non-blocking)

No defensive <8KB Buffer-pool copy here. PR2's pool defensively copies into a dedicated ArrayBuffer if the input Buffer is a slice over the shared 8KB pool. This pool doesn't — it casts .buffer as ArrayBuffer directly. That's fine for the current use case because shader-blend at 16bpc operates on width × height × 6 byte buffers (≥2.4MB for 854×480) — always over the 8KB pool threshold.

If a future caller ever passes shader-blend a smaller buffer (e.g. for a thumbnail-resolution preview blend, or test fixture), postMessage will throw DataCloneError. Worth either:

  • (a) Inheriting the defensive copy from PR2's pool for symmetry, OR
  • (b) Adding an explicit invariant comment: "This pool requires all input buffers to be Buffer.alloc() over dedicated ArrayBuffers (>= 8KB). Smaller buffers will fail transferList. Acceptable invariant for the 16bpc-rgb48le use case but breaks if shader-blend ever runs on smaller resolutions."

Defensible either way; (b) is the lower-effort version of "no silent surprise."

Code-DRY observation (also non-blocking)

The two pools (this one + PR2's pngDecodeBlitWorkerPool) share ~95% of the implementation — resolveWorkerEntry, buildExecArgv, dispatch loop, lifecycle, error handling, even the trace-mode env switch. Could be extracted into a generic WorkerPool<TReq, TRes> base in a follow-up. Not gating, but the two-pool duplication will compound if a third pool (e.g. for caption text rasterization) gets added later.

Praise

  • Same architectural discipline as PR2 — "standalone" + tested before wired into production. Easy to land + audit independently.
  • The "Why a SEPARATE pool" rationale paid off here too: byte-equivalence pinned across all 15 shaders (per test claim) is strong evidence the offloaded blend is identical to the inline path.
  • Per-task traceEnabled is the right primitive for the perf-validation step that lands in PR4/PR5 — being able to confirm concurrency empirically (vs assume from concurrency-control code review) is exactly what perf changes need.

Review by Rames Jusso (pr-review)

miguel-heygen
miguel-heygen previously approved these changes May 12, 2026
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review additive to Rames.

Structurally parallel to #757 — same pool shape, same patterns, same quality. Two notes:

  1. Missing <8KB Buffer pool defense — Rames flagged this already but framed it as optional. I lean toward (b): add an explicit invariant comment rather than inheriting the defensive copy. The shader pool's buffers are always width × height × 6 (minimum ~2.4MB at 854×480), so the defense would be dead code that obscures the real contract. A comment like // INVARIANT: all input buffers are Buffer.alloc() over dedicated ArrayBuffers (>= 8KB). Smaller buffers will fail transferList. at the postMessage call site is sufficient and cheaper to maintain.

  2. Code-DRY opportunity — also flagged by Rames. The two pools share ~95% of their implementation. A generic WorkerPool<TReq, TRes> base would be a good follow-up if a third pool materializes. For now the duplication is acceptable — two copies is annoying but manageable; three would warrant extraction.

Build wiring (tsup entry, esbuild entry, subpath export) mirrors #757 exactly. CI green. Ship it.

— Magi

@vanceingalls vanceingalls force-pushed the vai/677-2-decode-pool branch from 978f510 to 2dac19a Compare May 13, 2026 20:36
@vanceingalls vanceingalls force-pushed the vai/677-3-shader-pool branch from d209d60 to 4279ba2 Compare May 13, 2026 20:36
@vanceingalls vanceingalls force-pushed the vai/677-2-decode-pool branch 2 times, most recently from 703a6f8 to ede5cde Compare May 13, 2026 21:05
@vanceingalls vanceingalls force-pushed the vai/677-3-shader-pool branch from 4279ba2 to 20bd055 Compare May 13, 2026 21:05
@vanceingalls vanceingalls changed the base branch from vai/677-2-decode-pool to graphite-base/758 May 13, 2026 21:52
@vanceingalls vanceingalls force-pushed the vai/677-3-shader-pool branch from 20bd055 to cdeafae Compare May 13, 2026 21:52
@graphite-app graphite-app Bot changed the base branch from graphite-base/758 to main May 13, 2026 21:53
@graphite-app graphite-app Bot dismissed stale reviews from miguel-heygen and jrusso1020 May 13, 2026 21:53

The base branch was changed.

hf#677 follow-up. Adds a `worker_threads`-based pool that runs the
shader-transition blend (one of 15 transition shaders) on a fixed-size
worker pool. No production wiring yet — the pool stands alone and the
hybrid path in PR 4 wires it up.

Why a separate pool: the shader blend is a hot inner loop iterated
over every pixel of every transition frame at 16bpc. Running it on the
main event loop was previously the dominant wall-time bottleneck once
DOM capture parallelism was reached. Moving it to `worker_threads`
unblocks the main thread for CDP dispatch and pipelined decode/blit.

Pieces:

* `shaderTransitionWorker.ts` — the worker entry. Imports `TRANSITIONS`
  + `crossfade` from `@hyperframes/engine/shader-transitions` (the
  subpath added here on the engine package). Like the alpha-blit
  subpath, this is a zero-import TS file so it survives the
  `new Worker(<path>)` loader boundary without dragging in the producer
  module graph.

* `shaderTransitionWorkerPool.ts` — fixed-size pool with `run()` API.
  Uses `transferList` so the 16bpc HDR `from`/`to`/`out` buffers move
  by ownership across the boundary instead of being copied.

* `shaderTransitionWorkerPool.test.ts` — 6 vitest tests pinning byte-
  equivalence across all 15 shaders against an inline reference,
  transferList correctness, and pool lifecycle (concurrent dispatch +
  termination). All pass.

Build wiring:

* `packages/cli/tsup.config.ts`: third tsup entry emits
  `dist/shaderTransitionWorker.js` alongside `dist/cli.js`. Same
  rationale as the pngDecodeBlitWorker entry in PR 2 — the pool's
  `new Worker(<path>)` resolver probes for that file next to its
  loaded module.

* `packages/producer/build.mjs`: fourth esbuild entry emits the worker
  as `dist/services/shaderTransitionWorker.js` for direct producer
  consumers. Adds the `@hyperframes/engine/shader-transitions`
  workspace alias to the existing `workspaceAliasPlugin`.

* `packages/engine/package.json`: adds `./shader-transitions` subpath
  export pointing at `src/utils/shaderTransitions.ts`. The file is
  already import-free so the worker can consume the TS source directly.

No behavior change in any render. PR 3 of 5 in the hf#732 decomposition
stack; stacked on top of PR 2 (pngDecodeBlit pool).

-- Vai

Co-Authored-By: Vai <vai@heygen.com>
@vanceingalls vanceingalls force-pushed the vai/677-3-shader-pool branch from cdeafae to 525c597 Compare May 13, 2026 21:53
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after Graphite rebase.

This is a clean, well-structured worker pool that mirrors pngDecodeBlitWorkerPool from PR 2/5. Reviewed the full +860 line diff against the sibling pool for API consistency.

Worker lifecycle and cleanup -- Correct. Spawn-time failure path terminates partially-spawned workers. terminate() rejects both queued and in-flight tasks before calling Worker.terminate(), preventing leaked promises. The onWorkerExit handler rejects the current task on unexpected exit without auto-respawning (right call -- buffers are lost). Idempotent terminate guard prevents double-teardown.

Buffer transfer semantics -- Correct. The three rgb48le ArrayBuffers move via transferList (zero-copy), the worker wraps them with Buffer.from(ab) (zero-copy view), and the reply transfers all three back. The error path also transfers buffers back so the caller can release them. Unlike the PNG pool, this pool does not need the 8KB shared-pool safety net because the caller always uses Buffer.alloc() for rgb48le buffers which creates dedicated ArrayBuffers.

Shader dispatch correctness -- TRANSITIONS[shader] ?? crossfade fallback matches the inline path exactly. The test suite pins byte-equivalence across all 15 shaders plus the unknown-shader fallback. Concurrent dispatch test with pool-size=4 and 8 tasks confirms no slot leakage or result misrouting.

API consistency with pngDecodeBlitWorkerPool -- Pool shape is structurally identical: same PoolLogger interface, same WorkerSlot/PendingTask internals, same resolveWorkerEntry 4-step resolution, same buildExecArgv tsx injection, same run/terminate public API, same trace instrumentation pattern (HF_SHADER_POOL_TRACE=1). The only differences are domain-specific (different buffer shapes, different env var names, no 8KB pool safety net). This is appropriate.

Build wiring -- tsup entry, esbuild entry, and subpath export all follow the same pattern as the PNG worker from PR 2/5. The @hyperframes/engine/shader-transitions subpath points at a zero-import TS file, matching the ./alpha-blit precedent.

Test coverage -- 6 tests covering byte-equivalence (single shader + all 15), transfer semantics (detach verification), unknown shader fallback, concurrent dispatch, explicit workerEntryPath, and graceful terminate with queued tasks. The terminate test is race-tolerant (accepts both resolve and reject for the queued task). Solid.

Ship it.

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved at 525c5975 — carrying forward my prior APPROVE.

Per Rule 3 (additive, not redundant): same diff size (+860/-1 across 6 files) as my prior review at d209d604. Looks like the Graphite-rebase pattern — same as hf#757 re-review I just posted ~15 min ago. Code surfaces are unchanged.

Stack status

  • hf#756 (worker-count cap, stack root): merged
  • hf#757 (PR 2/5, pngDecodeBlit pool): re-approved at ede5cde3 by me + Magi ~15 min ago. New workerEntryPath factory option + bundled-CLI build wiring landed there.
  • hf#758 (this PR, PR 3/5, shaderTransition pool): same 978f5103-era code, functionally unchanged.

If the same workerEntryPath-style addition is incoming for this pool in a future commit (parallel to hf#757), that'll need its own pass — but at HEAD nothing has shifted since my prior review.

Prior advisory items at HEAD

Same disposition as before — neither addressed, both still non-blocking:

  1. No <8KB Buffer pool defensive copy (vs PR2's pool which has one). Magi and I converged on option (b): add an explicit invariant comment at the postMessage site noting "all input buffers are Buffer.alloc() over dedicated ArrayBuffers (≥ 8KB); smaller buffers will fail transferList." The shader-blend's buffer sizes (width × height × 6 ≥ 2.4MB at 854×480) make this invariant safe today; a one-line comment makes the invariant load-bearing if someone ever runs shader-blend on smaller resolutions (thumbnail preview, test fixture).

  2. Code-DRY across the two poolspngDecodeBlitWorkerPool and this one share ~95% of the implementation (resolveWorkerEntry, buildExecArgv, dispatch loop, lifecycle, error handling, trace-mode switch). Both reviewers flagged this. Two copies is manageable; if a third pool gets added (Magi mentioned caption rasterization as a candidate), a generic WorkerPool<TReq, TRes> base extraction becomes worth doing.

Neither blocks merge; both are follow-up housekeeping.

CI

Comprehensive green on this SHA: Lint ✓, Format ✓, File size check ✓, Build ✓, Test ✓, Typecheck ✓, Test: runtime contract ✓, CLI smoke (required) ✓, Smoke: global install ✓, regression ✓, preview-regression ✓, Preview parity ✓, player-perf ✓, CodeQL ✓. The 10 regression-shards + 2 windows-latest jobs are in_progress; no reds anywhere.

Carry-forward praise

Praise from my prior review still applies to the unchanged surfaces:

  • Architectural discipline: standalone pool, tested before wired into production (PR 4 wires it)
  • 6 tests pin byte-equivalence across all 15 shaders — strongest possible evidence that the offloaded blend is identical to the inline path
  • buildExecArgv tsx-loader-injection pattern matches PR2 (consistent dev/test behavior)
  • transferList contract documented at both API JSDoc and inline
  • Pool size clamping [1, cpus().length]
  • Per-task traceEnabled is the right primitive for the perf-validation step that lands in PR 4/5

Verdict

APPROVE re-affirmed. Ready to merge after hf#757 lands (stack ordering). Ship 🚀

Review by Rames Jusso (pr-review)

@vanceingalls vanceingalls merged commit 30348af into main May 13, 2026
39 checks passed
Copy link
Copy Markdown
Collaborator Author

Merge activity

@vanceingalls vanceingalls deleted the vai/677-3-shader-pool branch May 13, 2026 22:18
@jrusso1020 jrusso1020 mentioned this pull request May 13, 2026
vanceingalls added a commit that referenced this pull request May 14, 2026
…n renders (hf#732 PR 4/5) (#759)

## Summary

PR 4 of 5 in the hf#732 decomposition stack. **This is where the bulk of the shader-transition speedup lives** (`~2×` verified — see Empirical validation below).

Spreads per-frame DOM capture work across N DOM worker sessions and offloads the per-pixel shader-blend onto a `worker_threads` pool (the pool added in #758).

### Gating

The hybrid path is gated by `shouldUseHybridLayeredPath`:

- SDR content only — HDR raw-frame sources are fd-bound to one worker (per-worker `dup(fd)` is out of scope here).
- `workerCount >= 2`.
- Not every frame inside a transition window.

When the gate trips, the hybrid loop spawns `workerCount - 1` extra DOM sessions, allocates per-worker scratch buffers, and partitions the frame range into contiguous slices via `distributeLayeredHybridFrameRanges`. Each worker walks its slice; transitions dispatch through the shader-blend pool (with inline fallback). A frame-reorder buffer fences the encoder.

Pool teardown is guaranteed via try/finally on both the success and error paths.

### Structural change (heads-up to reviewers)

`captureHdrStage.ts` on main was already 921 lines (over the project's 500-line ceiling). Adding the hybrid path on top would push it past 1100 and the local pre-commit hook refuses to stage files past 500. **PR 4 splits `captureHdrStage.ts` into 5 files**:

- `captureHdrStage.ts` (orchestrator + cleanup invariants, 469 lines)
- `captureHdrResources.ts` (HDR video extraction + image decode + dim probing)
- `captureHdrFrameShared.ts` (gating predicates, partitioning, per-scene capture)
- `captureHdrSequentialLoop.ts` (legacy single-session loop)
- `captureHdrHybridLoop.ts` (new multi-worker path)

No behavior change in any pre-existing code path: the sequential loop is byte-equivalent to the previous inline implementation (both consume `captureSceneIntoBuffer` from the shared module, so behavior parity is enforced structurally rather than by comment-keeping).

`renderOrchestrator.ts` is intentionally unchanged — the stage computes its own worker budget via `calculateOptimalWorkers` rather than receiving it through the call signature.

## Stack

Stacked on top of #758 (PR 3: shaderTransition pool).

## Test plan

- [x] 14 new vitest tests in `captureHdrFrameShared.test.ts` pinning the hybrid gating predicate and the contiguous-chunking partitioner — all pass
- [x] Producer typecheck clean
- [x] oxlint clean

### Empirical validation

Mark Witt fixture (Mac, Apple Silicon, hardware GPU, no beginframe):
- Published CLI (pre-stack): 2m 12.2s
- Cascade CLI (this stack): 1m 07.7s
- **Measured speedup: 1.95× on Mac.** (Earlier "2.22×" wording was a projection from per-component micro-benchmarks; the empirical end-to-end number is 1.95× on the validated fixture.)

Linux CI confirmation pending top-of-stack regression run.

— Vai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants