fix(engine): force screenshot capture when GPU is software (broader than HDR-only guard)#822
Open
vanceingalls wants to merge 3 commits into
Open
fix(engine): force screenshot capture when GPU is software (broader than HDR-only guard)#822vanceingalls wants to merge 3 commits into
vanceingalls wants to merge 3 commits into
Conversation
…han HDR-only guard) When the host has no hardware GPU (SwiftShader software rendering), Chrome's `HeadlessExperimental.beginFrame` stalls the compositor on shader-heavy frames — the CDP call hangs and the render eventually fails with `beginFrame timed out` at protocolTimeout (default 5 min). This was reproduced on a CPU-only Linux sandbox during the hf#677 BeginFrame spike and is likely the same root cause as the earlier Mac-Docker timeout that the HDR-only guard in `captureHdrStage.ts` partially mitigated. The existing per-stage guard (`cfg.forceScreenshot = true` for the layered HDR composite) is too narrow — the issue is GPU-presence, not HDR-vs-SDR. SDR renders also stall on a software-renderer host. Fix: in `acquireBrowser`, resolve `browserGpuMode` (cheap — the probe Promise is cached for the process lifetime) and force `captureMode = "screenshot"` whenever it resolves to "software", regardless of platform or binary. Defensively strip BeginFrame-only chrome flags (`--enable-begin-frame-control` etc.) from the launch args in the same path — leaving them in screenshot mode produces blank captures because the compositor waits for beginFrames the engine no longer sends. Mirror the same gate in `frameCapture.ts`'s preMode computation so the chromeArgs are built without BeginFrame flags up front (the strip in `acquireBrowser` is belt-and-braces for any other caller). A one-time `console.warn` documents the resolution path for operators when the guard actually changed the outcome (Linux + headless-shell only — silent on the always-screenshot platforms and on hardware hosts). Note: on `main` as-is the layered HDR path already forces screenshot mode unconditionally, so the SDR-software-renderer case is the load-bearing one this PR fixes. The broader guard also keeps the engine safe once any future BeginFrame-alpha work lands on top. Tests cover: software → screenshot + BeginFrame flag stripping, hardware → beginframe preserved, forceScreenshot=true still wins, one-time software-GPU warning, no warn on macOS (already screenshot anyway). — Vai
…iew) Three fixes from PR #822 self-review: 1. Gate the "Software GPU detected" console.warn so it only fires when the guard actually changed the outcome. Previously the warn condition (captureMode==="screenshot" && isSoftwareRenderer && headlessShell && isLinux) was satisfied even when the caller had explicitly set forceScreenshot=true — misleading operators into thinking the guard kicked in when they had already picked screenshot mode. Add `!forceScreenshot` to the condition and one-shot the warn with a process-level latch to avoid spam across multi-worker renders. 2. Thread the already-resolved browserGpuMode from frameCapture.ts into acquireBrowser via a new AcquireBrowserOptions.resolvedBrowserGpuMode. The Promise cache made the duplicate resolution cheap, but threading the value removes the smell of two parallel resolutions of the same thing with no static guarantee they agree under future refactors. 3. Add fallback-launch-retry test: simulate probeBeginFrameSupport failure via a CDP mock whose `send` rejects, assert the second launch strips beginframe-only flags. Also add positive-control tests for "Linux + hardware + headless-shell does NOT warn" and "forceScreenshot suppresses the software-GPU warn". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssion) PR #822's software-renderer guard auto-flipped `captureMode` to "screenshot" whenever `resolveBrowserGpuMode` returned "software" — but the WebGL probe classifies GitHub Actions runners (SwiftShader) as "software" too, even though `HeadlessExperimental.beginFrame` works on them. As a result the guard pessimized mainstream CI, exceeding `ffmpegStreamingTimeout` (10 min default) on the largest fixtures and producing this regression on three shards of run 25841853063: - fast → sub-composition-video (390 frames, 1080×1920, video + sub-comp) - styles-e → style-13-prod (12 MB baseline, slow tag) - styles-g → overlay-montage-prod (42.36 s duration, 1080×1920) All three failed with `Streaming encode failed: FFmpeg exited with code 255 ... received signal 15`. Wall-clock for overlay-montage-prod jumped from 9 m 13 s (main, beginframe) to 10 m 38 s (PR-822, screenshot) — just past the streaming timeout. ffmpeg was alive and consuming frames the whole time; the producer side simply couldn't sustain the rate in screenshot mode for these fixtures. Fix: gate the auto-flip behind an opt-in env var `HYPERFRAMES_FORCE_SCREENSHOT_ON_SOFTWARE_GPU`. When unset (default), the GPU probe is skipped entirely and the original (pre-PR-822) capture-mode selection runs — Linux + headless-shell → beginframe, with the existing `probeBeginFrameSupport` fallback handling actual BeginFrame unavailability at runtime. Operators on hosts where BeginFrame is technically present but stalls under shader load (the CPU-only Linux sandbox where hf#677 was reproduced) set the env var to short-circuit BeginFrame entirely. Mirrored the same gate in `frameCapture.ts`'s preMode resolution so the expensive probe Chrome is also skipped on the chromeArgs side when the guard is off. Tests: kept all eight software-renderer tests; flipped the block-scoped default to "guard ON" (set in `beforeEach`) so the pre-existing assertions exercise the guard path. Added a positive-control test that pins the new default behavior: `Linux + software + headless-shell + guard OFF → beginframe` (the regression case). — Vai
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
In
packages/engine/src/services/browserManager.ts:acquireBrowser, resolvebrowserGpuMode(cheap —resolveBrowserGpuModecaches its probe Promise for the process lifetime) and forcecaptureMode = "screenshot"whenever it resolves to"software", regardless of platform/binary. Strip BeginFrame-only chrome flags (--enable-begin-frame-control,--deterministic-mode, etc.) from the launch args when the guard kicks in. Mirror the same gate inpackages/engine/src/services/frameCapture.ts'spreModecomputation so the chromeArgs are built without BeginFrame flags up front. Thread the already-resolved mode through toacquireBrowservia a newAcquireBrowserOptions.resolvedBrowserGpuModeso the call site does not re-resolve from raw config.Why
On hosts with no hardware GPU (SwiftShader software rendering), Chrome's
HeadlessExperimental.beginFramestalls the compositor on shader-heavy frames — the CDP call hangs and the render eventually fails withbeginFrame timed outat the protocolTimeout (default 5 min). This was reproduced during a recent BeginFrame spike on a CPU-only Linux sandbox.This is an independent, defense-in-depth guard from the producer-side
captureHdrStage.tscfg.forceScreenshot = true:captureHdrStage.ts:211already forces screenshot unconditionally for the HDR layered-composite path onmain. That covers the HDR path on any host.acquireBrowserfor the SDR-render-on-software-host case — whichcaptureHdrStagedoes not touch, and which is the load-bearing failure mode today (SDR renders on a software-renderer Linux host still hit the BeginFrame stall).Earlier framings of this PR described it as "broadening" a narrow
compositeTransfer === "srgb"guard from PR #787. That framing was wrong — no such narrow guard exists onmain; the HDR path's screenshot fallback is unconditional. The contribution stands on its own as the SDR-on-software-host fix.How
acquireBrowsernow acceptsbrowserGpuModein itsconfigPick and (optionally) a pre-resolvedresolvedBrowserGpuModeviaAcquireBrowserOptions. When the resolved mode is"software", the existing beginframe branch is skipped.stripBeginFrameFlagswhenevercaptureMode === "screenshot"— defensive against callers that built args before knowing the GPU resolution. (No-op when args are already clean.)frameCapture.ts'spreModecomputation gets the same GPU gate sobuildChromeArgsdoesn't include BeginFrame flags up-front on software hosts; the resolved mode is then threaded intoacquireBrowserto avoid a redundant re-resolve.console.warndocuments the resolution path for operators only when the guard actually changed the outcome — gated on!forceScreenshot && isSoftwareRenderer && headlessShell && isLinux, with a process-level latch so multi-worker renders don't spam.resolveBrowserGpuModecaches the auto-probe Promise per process, so the threaded value is just a smell-removal — the underlying call would be free anyway after the first invocation.Test plan
packages/engine/src/services/browserManager.test.ts, 8 new tests inacquireBrowser — software-renderer guard):software→ captureMode forced to screenshot + BeginFrame flags strippedhardware+ Linux + headless-shell → captureMode staysbeginframe, flags preservedforceScreenshot: truestill wins regardless of GPU mode (regression guard for existing behavior)console.warnfires when the guard changed the outcomeforceScreenshot: true(guard did not change outcome)probeBeginFrameSupportfailure → close + re-launch withstripBeginFrameFlags(chromeArgs)(covers the pre-existing defense-in-depth path)bun run --filter @hyperframes/engine typecheck— passesbun run --filter @hyperframes/engine build— passesbun run --filter @hyperframes/engine test— 596/599 passing (3ffprobe.test.tsfailures are pre-existing onorigin/main, unrelated: PNG cICP metadata extraction + ffprobe binary handling)bun run --filter @hyperframes/producer typecheck— passes (consumer of the changed types)— Vai