Skip to content

fix(engine): force screenshot capture when GPU is software (broader than HDR-only guard)#822

Open
vanceingalls wants to merge 3 commits into
mainfrom
vai/engine-software-renderer-guard
Open

fix(engine): force screenshot capture when GPU is software (broader than HDR-only guard)#822
vanceingalls wants to merge 3 commits into
mainfrom
vai/engine-software-renderer-guard

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented May 14, 2026

What

In packages/engine/src/services/browserManager.ts:acquireBrowser, resolve browserGpuMode (cheap — resolveBrowserGpuMode caches its probe Promise for the process lifetime) and force captureMode = "screenshot" whenever it resolves to "software", regardless of platform/binary. Strip BeginFrame-only chrome flags (--enable-begin-frame-control, --deterministic-mode, etc.) from the launch args when the guard kicks in. Mirror the same gate in packages/engine/src/services/frameCapture.ts's preMode computation so the chromeArgs are built without BeginFrame flags up front. Thread the already-resolved mode through to acquireBrowser via a new AcquireBrowserOptions.resolvedBrowserGpuMode so the call site does not re-resolve from raw config.

Why

On hosts with no hardware GPU (SwiftShader software rendering), Chrome's HeadlessExperimental.beginFrame stalls the compositor on shader-heavy frames — the CDP call hangs and the render eventually fails with beginFrame timed out at the protocolTimeout (default 5 min). This was reproduced during a recent BeginFrame spike on a CPU-only Linux sandbox.

This is an independent, defense-in-depth guard from the producer-side captureHdrStage.ts cfg.forceScreenshot = true:

  • captureHdrStage.ts:211 already forces screenshot unconditionally for the HDR layered-composite path on main. That covers the HDR path on any host.
  • This PR adds a separate guard at acquireBrowser for the SDR-render-on-software-host case — which captureHdrStage does not touch, and which is the load-bearing failure mode today (SDR renders on a software-renderer Linux host still hit the BeginFrame stall).
  • Belt-and-braces: even when both paths land, having the GPU-mode guard at the launch site protects future code that might add new BeginFrame entry points (e.g. a future BeginFrame-alpha path) without remembering to gate on GPU presence.

Earlier framings of this PR described it as "broadening" a narrow compositeTransfer === "srgb" guard from PR #787. That framing was wrong — no such narrow guard exists on main; the HDR path's screenshot fallback is unconditional. The contribution stands on its own as the SDR-on-software-host fix.

How

  1. acquireBrowser now accepts browserGpuMode in its config Pick and (optionally) a pre-resolved resolvedBrowserGpuMode via AcquireBrowserOptions. When the resolved mode is "software", the existing beginframe branch is skipped.
  2. The launch args go through stripBeginFrameFlags whenever captureMode === "screenshot" — defensive against callers that built args before knowing the GPU resolution. (No-op when args are already clean.)
  3. frameCapture.ts's preMode computation gets the same GPU gate so buildChromeArgs doesn't include BeginFrame flags up-front on software hosts; the resolved mode is then threaded into acquireBrowser to avoid a redundant re-resolve.
  4. A one-time console.warn documents the resolution path for operators only when the guard actually changed the outcome — gated on !forceScreenshot && isSoftwareRenderer && headlessShell && isLinux, with a process-level latch so multi-worker renders don't spam.

resolveBrowserGpuMode caches the auto-probe Promise per process, so the threaded value is just a smell-removal — the underlying call would be free anyway after the first invocation.

Test plan

  • Unit tests added (packages/engine/src/services/browserManager.test.ts, 8 new tests in acquireBrowser — software-renderer guard):
    • software → captureMode forced to screenshot + BeginFrame flags stripped
    • hardware + Linux + headless-shell → captureMode stays beginframe, flags preserved
    • forceScreenshot: true still wins regardless of GPU mode (regression guard for existing behavior)
    • one-time software-GPU console.warn fires when the guard changed the outcome
    • no warning on macOS (already screenshot anyway)
    • no warning when caller set forceScreenshot: true (guard did not change outcome)
    • no warning on Linux + hardware + headless-shell (positive control: the gate works in the right direction)
    • fallback launch retry: probeBeginFrameSupport failure → close + re-launch with stripBeginFrameFlags(chromeArgs) (covers the pre-existing defense-in-depth path)
  • bun run --filter @hyperframes/engine typecheck — passes
  • bun run --filter @hyperframes/engine build — passes
  • bun run --filter @hyperframes/engine test — 596/599 passing (3 ffprobe.test.ts failures are pre-existing on origin/main, unrelated: PNG cICP metadata extraction + ffprobe binary handling)
  • bun run --filter @hyperframes/producer typecheck — passes (consumer of the changed types)

— Vai

vanceingalls and others added 3 commits May 14, 2026 01:51
…han HDR-only guard)

When the host has no hardware GPU (SwiftShader software rendering), Chrome's
`HeadlessExperimental.beginFrame` stalls the compositor on shader-heavy
frames — the CDP call hangs and the render eventually fails with
`beginFrame timed out` at protocolTimeout (default 5 min). This was
reproduced on a CPU-only Linux sandbox during the hf#677 BeginFrame spike
and is likely the same root cause as the earlier Mac-Docker timeout that
the HDR-only guard in `captureHdrStage.ts` partially mitigated.

The existing per-stage guard (`cfg.forceScreenshot = true` for the layered
HDR composite) is too narrow — the issue is GPU-presence, not HDR-vs-SDR.
SDR renders also stall on a software-renderer host.

Fix: in `acquireBrowser`, resolve `browserGpuMode` (cheap — the probe
Promise is cached for the process lifetime) and force
`captureMode = "screenshot"` whenever it resolves to "software", regardless
of platform or binary. Defensively strip BeginFrame-only chrome flags
(`--enable-begin-frame-control` etc.) from the launch args in the same
path — leaving them in screenshot mode produces blank captures because the
compositor waits for beginFrames the engine no longer sends.

Mirror the same gate in `frameCapture.ts`'s preMode computation so the
chromeArgs are built without BeginFrame flags up front (the strip in
`acquireBrowser` is belt-and-braces for any other caller).

A one-time `console.warn` documents the resolution path for operators
when the guard actually changed the outcome (Linux + headless-shell only —
silent on the always-screenshot platforms and on hardware hosts).

Note: on `main` as-is the layered HDR path already forces screenshot
mode unconditionally, so the SDR-software-renderer case is the load-bearing
one this PR fixes. The broader guard also keeps the engine safe once any
future BeginFrame-alpha work lands on top.

Tests cover: software → screenshot + BeginFrame flag stripping, hardware →
beginframe preserved, forceScreenshot=true still wins, one-time
software-GPU warning, no warn on macOS (already screenshot anyway).

— Vai
…iew)

Three fixes from PR #822 self-review:

1. Gate the "Software GPU detected" console.warn so it only fires when the
   guard actually changed the outcome. Previously the warn condition
   (captureMode==="screenshot" && isSoftwareRenderer && headlessShell &&
   isLinux) was satisfied even when the caller had explicitly set
   forceScreenshot=true — misleading operators into thinking the guard
   kicked in when they had already picked screenshot mode. Add
   `!forceScreenshot` to the condition and one-shot the warn with a
   process-level latch to avoid spam across multi-worker renders.

2. Thread the already-resolved browserGpuMode from frameCapture.ts into
   acquireBrowser via a new AcquireBrowserOptions.resolvedBrowserGpuMode.
   The Promise cache made the duplicate resolution cheap, but threading
   the value removes the smell of two parallel resolutions of the same
   thing with no static guarantee they agree under future refactors.

3. Add fallback-launch-retry test: simulate probeBeginFrameSupport
   failure via a CDP mock whose `send` rejects, assert the second launch
   strips beginframe-only flags. Also add positive-control tests for
   "Linux + hardware + headless-shell does NOT warn" and "forceScreenshot
   suppresses the software-GPU warn".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssion)

PR #822's software-renderer guard auto-flipped `captureMode` to "screenshot"
whenever `resolveBrowserGpuMode` returned "software" — but the WebGL probe
classifies GitHub Actions runners (SwiftShader) as "software" too, even
though `HeadlessExperimental.beginFrame` works on them. As a result the
guard pessimized mainstream CI, exceeding `ffmpegStreamingTimeout` (10 min
default) on the largest fixtures and producing this regression on three
shards of run 25841853063:

  - fast → sub-composition-video (390 frames, 1080×1920, video + sub-comp)
  - styles-e → style-13-prod (12 MB baseline, slow tag)
  - styles-g → overlay-montage-prod (42.36 s duration, 1080×1920)

All three failed with `Streaming encode failed: FFmpeg exited with code 255
... received signal 15`. Wall-clock for overlay-montage-prod jumped from
9 m 13 s (main, beginframe) to 10 m 38 s (PR-822, screenshot) — just past
the streaming timeout. ffmpeg was alive and consuming frames the whole
time; the producer side simply couldn't sustain the rate in screenshot
mode for these fixtures.

Fix: gate the auto-flip behind an opt-in env var
`HYPERFRAMES_FORCE_SCREENSHOT_ON_SOFTWARE_GPU`. When unset (default), the
GPU probe is skipped entirely and the original (pre-PR-822) capture-mode
selection runs — Linux + headless-shell → beginframe, with the existing
`probeBeginFrameSupport` fallback handling actual BeginFrame unavailability
at runtime. Operators on hosts where BeginFrame is technically present but
stalls under shader load (the CPU-only Linux sandbox where hf#677 was
reproduced) set the env var to short-circuit BeginFrame entirely.

Mirrored the same gate in `frameCapture.ts`'s preMode resolution so the
expensive probe Chrome is also skipped on the chromeArgs side when the
guard is off.

Tests: kept all eight software-renderer tests; flipped the block-scoped
default to "guard ON" (set in `beforeEach`) so the pre-existing assertions
exercise the guard path. Added a positive-control test that pins the new
default behavior: `Linux + software + headless-shell + guard OFF →
beginframe` (the regression case).

— Vai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant