Skip to content

perf(producer): native WebGPU shader-blend via Dawn (opt-in, Mac-Metal)#831

Draft
vanceingalls wants to merge 1 commit into
vai/engine-software-renderer-guardfrom
vai/perf-dawn-webgpu-compositor
Draft

perf(producer): native WebGPU shader-blend via Dawn (opt-in, Mac-Metal)#831
vanceingalls wants to merge 1 commit into
vai/engine-software-renderer-guardfrom
vai/perf-dawn-webgpu-compositor

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented May 14, 2026

Stacked on #822 — review/merge that first.

EXPERIMENTAL — spike for Vance to pull and test locally on Mac. Draft, do not stamp.

Summary

Adds a Node-side WebGPU compositor for the shader-transition blend, backed by Google's Dawn implementation via the webgpu npm package. Wired into the existing shaderTransitionWorker as an opt-in path. Default OFF — set HF_DAWN_WEBGPU=1 or pass the new --gpu-shader-blend CLI flag to engage.

This is option B from the reference_5x_shader_perf_alternatives.md memo. Sibling spike for option A (page-side compositing) is being drafted in parallel.

Why

The hf#677 chain (PRs #756-#760) gets us 1.95× on Mac via a CPU shader worker pool, ring-buffered pipelining, and a hybrid layered/parallel capture path. The remaining ceiling is the per-pixel JS blend itself — scalar f64 math in v8. On any host with a real GPU (Mac/Metal, Linux/Vulkan, Windows/D3D), moving the blend to the GPU via Dawn should drop blend wall-time on a 854×480 rgb48le frame from ~150–910 ms (depending on shader complexity) to a few ms.

HeadlessExperimental.beginFrame is structurally Mac-unavailable (crbug.com/40656275); native WebGPU is the next-best Mac-viable lever — see #787 / the reference_beginframe_mac_structural_limit.md memo.

Dawn package choice

webgpu@0.4.0 (dawn-gpu/node-webgpu, maintained by Dawn upstream — Corentin Wallez, Kai Ninomiya, etc.). Picked because:

  • It ships a darwin-universal.dawn.node prebuilt that targets Apple Metal directly (no source build, no rosetta).
  • Linux x64/arm64 Vulkan + Windows x64 D3D prebuilts ship in the same package.
  • Maintained by Dawn upstream — the canonical Node binding.
  • Verified locally that the package loads on Linux x64; adapter init returns null in the sandbox (no Vulkan driver) and the worker transparently falls back to the existing CPU path.

The earlier 2026-05-12 Dawn spike concluded "no GPU available" on the Linux sandbox; this PR is the re-spike with Mac as the primary target.

Flag

--gpu-shader-blend on the CLI, or HF_DAWN_WEBGPU=1 in the environment. There's also HF_DAWN_FORCE_FAIL=1 for testing the fallback engages cleanly.

What changed

  • packages/producer/src/services/shaderTransitionGpu.ts (NEW) — compositor: dynamic-imports webgpu, requests adapter + device, compiles WGSL per supported transition, manages per-size GPU resources, dispatches blends, reads back to rgb48le. Returns structured init failures rather than throwing.
  • packages/producer/src/services/shaderTransitionWorker.ts — augmented: first message under HF_DAWN_WEBGPU=1 triggers a one-time GPU probe; supported shaders run on the GPU thereafter; everything else falls through to the existing TRANSITIONS[shader] ?? crossfade CPU path with zero behavior change. Mid-render GPU failure disables the GPU path and falls back.
  • packages/cli/src/commands/render.ts--gpu-shader-blend flag sets the env var before pool spawn.
  • packages/producer/build.mjs + packages/cli/tsup.config.tswebgpu marked external (70+ MB native binary, must not be bundled).
  • packages/producer/package.json + packages/cli/package.jsonwebgpu@^0.4.0 as optionalDependencies (native binary; feature is opt-in).
  • packages/producer/src/services/shaderTransitionGpu.test.ts (NEW) — 3 vitest tests: HF_DAWN_FORCE_FAIL short-circuit, init never throws, PSNR-vs-CPU ≥ 50dB when an adapter is available (auto-skipped on no-GPU hosts).

Wall-time pair (Linux sandbox, no GPU — fallback path only)

Bundled CLI worker smoke (854×480 single crossfade blend, harness at /tmp/validate-gpu-worker.mjs):

config wall notes
CPU baseline (HF_DAWN_WEBGPU unset) 67 ms reference
HF_DAWN_WEBGPU=1 (no GPU, fell back) 70 ms +3 ms init/probe, then CPU
HF_DAWN_WEBGPU=1 + HF_DAWN_FORCE_FAIL=1 66 ms fallback verified

Both fallback runs produced byte-identical output to the CPU baseline checksum — fallback fidelity confirmed.

Mac numbers (TODO — Vance)

Pull this branch, run a shader-transition fixture (e.g. Mark Witt's 14-transition fixture you used for #759/#760) twice:

# Baseline (1.95× post-cascade)
node packages/cli/dist/cli.js render <fixture> -o /tmp/baseline.mp4 --workers 6
# With Dawn
node packages/cli/dist/cli.js render <fixture> -o /tmp/dawn.mp4 --workers 6 --gpu-shader-blend

Drop the wall-time pair in a comment + look for the GPU compositor active log line.

Determinism trade

GPU compute uses f32 + u16 storage; CPU canonical uses f64. Bit-exact equality is not realistic.

  • The default-off path keeps CI byte-equality pins intact — existing fixtures hit the CPU path unchanged.
  • The fallback CPU path is bit-exact with the canonical CPU implementation (no drift on no-GPU hosts).
  • Fixtures that exercise the GPU path must use a PSNR pin (≥ 50 dB) — documented in the new test.

Fallback behavior

Anywhere the GPU path can't run, the existing CPU path runs unchanged:

  • webgpu package not installed → CPU
  • webgpu loads but requestAdapter() returns null → CPU
  • WGSL pipeline compile fails → CPU
  • --gpu-shader-blend set but shader not in SHADERS_WGSL → CPU (per-frame)
  • Mid-render GPU failure → log + disable GPU + finish on CPU

Failure reason logged once per worker (not per frame). No crash path.

Coverage in this spike

One representative shader (crossfade) ported to WGSL end-to-end as proof of correctness. Other shaders transparently fall through to CPU even when the flag is on. Porting the remaining 12 is a mechanical add to SHADERS_WGSL in shaderTransitionGpu.ts.

Follow-ups

  • Port the remaining shaders (flashThroughWhite, chromaticSplit, sdfIris, glitch, lightLeak, crossWarpMorph, whipPan, cinematicZoom, gravitationalLens, rippleWaves, swirlVortex, thermalDistortion, domainWarp, ridgedBurn).
  • After Mac wall-time confirms the lever: decide flip-default-on vs keep opt-in.
  • PSNR-only fixture pins for any CI fixture intended to exercise the GPU path.
  • Persistent device + pipeline cache shared across workers (currently per-worker init).

Test plan

  • Producer typecheck clean
  • CLI typecheck clean
  • oxlint + oxfmt clean
  • shaderTransitionGpu.test.ts — 3 tests pass (force-fail short-circuit + no-throw init + PSNR skip-when-no-GPU)
  • shaderTransitionWorkerPool.test.ts — all 7 existing pool tests still pass (CPU path byte-equivalence preserved)
  • Bundled CLI worker smoke (CPU / GPU-flag-fallback / force-fail) confirms fallback fidelity + clean logging
  • Vance: Mac wall-time pair on a shader-transition fixture

PR drafted by Vai

EXPERIMENTAL spike. Adds a Node-side WebGPU compositor backed by the
`webgpu` npm package (Dawn) and wires it into the existing
shader-transition worker as an opt-in alternative to the CPU blend.
Default OFF — set `HF_DAWN_WEBGPU=1` or pass `--gpu-shader-blend` to the
CLI to engage.

The hf#677 chain (PRs #756-#760) gets us 1.95x on Mac via a CPU shader
worker pool, ring-buffered pipelining, and a hybrid layered/parallel
capture path. The remaining ceiling is the per-pixel JS blend itself —
scalar f64 math in v8. On any host with a real GPU (Mac/Metal,
Linux/Vulkan, Windows/D3D), moving the blend to the GPU via Dawn
should drop blend wall time on a 854x480 rgb48le frame from
~150-910 ms (depending on shader complexity) to a few ms, projecting
3-5x end-to-end on top of the existing cascade — see the
`reference_5x_shader_perf_alternatives.md` memo (option B).

beginFrame is structurally Mac-unavailable
(crbug.com/40656275); native WebGPU is the next-best Mac-viable lever.

The Dawn binding (`webgpu@0.4.0`, dawn-gpu/node-webgpu, maintained by
Dawn upstream — Corentin Wallez, Kai Ninomiya) ships a `darwin-universal`
prebuilt that targets Apple Metal directly, plus Linux x64/arm64
Vulkan and Windows x64 D3D. Verified locally that the package loads on
Linux x64; adapter init returns null in the sandbox (no Vulkan driver)
and the worker transparently falls back to the existing CPU path.
Mac numbers must be measured on Vance's laptop — there's no GPU host
in CI.

- `packages/producer/src/services/shaderTransitionGpu.ts` (NEW)
  Node-side compositor: dynamic-imports `webgpu`, requests an adapter
  + device, compiles a WGSL compute shader per supported transition,
  manages per-size GPU resources, dispatches blends, reads back to
  `rgb48le`. Returns structured init failures rather than throwing —
  the caller can always fall back to CPU. `HF_DAWN_FORCE_FAIL=1`
  short-circuits init for testability of the fallback path.
- `packages/producer/src/services/shaderTransitionWorker.ts`
  Augmented: on the first message, if `HF_DAWN_WEBGPU=1`, dynamic-imports
  the GPU module and probes once. On success, supported shaders run on
  the GPU. Unsupported shaders (`glitch`, `domain-warp`, `swirl-vortex`,
  the rest) and any host without a GPU adapter fall through to the
  existing `TRANSITIONS[shader] ?? crossfade` CPU path with zero
  behavior change. Mid-render GPU failure disables the GPU path for the
  rest of the worker's life and falls back — the frame still completes.
- `packages/cli/src/commands/render.ts`
  New `--gpu-shader-blend` boolean flag (default false). When set, the
  CLI exports `HF_DAWN_WEBGPU=1` into `process.env` before any worker
  pool spawns. Env-var plumbing chosen over threading the flag through
  the orchestrator -> stage -> pool -> worker chain because env vars
  cross the `worker_threads` boundary unchanged.
- `packages/producer/build.mjs` + `packages/cli/tsup.config.ts`
  `webgpu` added to the `external` list in both bundlers. The Dawn
  binding ships a 70+ MB native `.dawn.node` binary per platform —
  it must be loaded from the user's node_modules at runtime, not
  inlined into the bundle.
- `packages/producer/package.json` + `packages/cli/package.json`
  `webgpu@^0.4.0` added to `optionalDependencies` on both. Optional
  because (a) it's a native binary that may fail to install on some
  hosts and (b) the feature is opt-in.
- `packages/producer/src/services/shaderTransitionGpu.test.ts` (NEW)
  3 vitest tests: `HF_DAWN_FORCE_FAIL` short-circuit, init never
  throws, and PSNR-vs-CPU >= 50 dB on hosts where the adapter is
  available (auto-skipped on Linux sandbox with a log line).

One representative shader (`crossfade`) is ported to WGSL end-to-end
as proof of correctness. The harness is shader-agnostic; porting more
is a mechanical add to `SHADERS_WGSL` in `shaderTransitionGpu.ts`.
Unsupported shaders transparently fall through to CPU even when the
flag is on.

GPU compute uses f32 + u16 storage; CPU canonical uses f64. Bit-exact
equality is not realistic. The default-off path keeps CI byte-equality
pins intact (the existing fixtures hit the CPU path unchanged). Fixtures
that exercise the GPU path must use a PSNR pin (>= 50 dB documented in
the new test). The fallback CPU path is bit-exact with the
canonical CPU implementation.

Bundled CLI worker smoke (854x480 single crossfade blend, see the
investigation log for the harness):

| config | wall | notes |
|---|---:|---|
| CPU baseline (`HF_DAWN_WEBGPU` unset) | 67 ms | reference |
| `HF_DAWN_WEBGPU=1` (fell back, no GPU) | 70 ms | +3 ms init+probe, then CPU |
| `HF_DAWN_WEBGPU=1` + `HF_DAWN_FORCE_FAIL=1` | 66 ms | fallback verified |

Both fallback runs produced byte-identical output to the CPU baseline
checksum — fallback fidelity confirmed.

**Mac numbers: Vance, please fill in.** Pull the branch, run a
shader-transition fixture both with and without `--gpu-shader-blend`,
and drop the wall-time pair in a comment.

Anywhere the GPU path can't run, the existing CPU path runs unchanged:

- `webgpu` package not installed (optional dep skipped on install) -> CPU
- `webgpu` loads but `requestAdapter()` returns null (no GPU) -> CPU
- WGSL pipeline compile fails -> CPU
- `--gpu-shader-blend` set but shader not in `SHADERS_WGSL` -> CPU
- Mid-render GPU failure -> log + disable GPU + finish on CPU

Failure reason is logged once per worker (not per frame). No crash path.

- Port the remaining 12 shaders to WGSL (mechanical: `flashThroughWhite`,
  `chromaticSplit`, `sdfIris`, `glitch`, `lightLeak`, `crossWarpMorph`,
  `whipPan`, `cinematicZoom`, `gravitationalLens`, `rippleWaves`,
  `swirlVortex`, `thermalDistortion`, `domainWarp`, `ridgedBurn`).
- Once Mac wall-time confirms the lever, decide on flip-default-on or
  keep opt-in.
- Update fixture pins for any fixture that needs to exercise the GPU
  path under CI (PSNR-only).
- A persistent device + pipeline cache shared across workers (currently
  per-worker init).

_PR drafted by Vai_

Co-Authored-By: Vai <vai@heygen.com>
@vanceingalls vanceingalls force-pushed the vai/perf-dawn-webgpu-compositor branch from 58a976b to 450bb01 Compare May 14, 2026 15:57
@vanceingalls vanceingalls changed the base branch from main to vai/engine-software-renderer-guard May 14, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant