Skip to content

[CI Testing Only] Capture offloader.exe crash dumps via WER (Debug build)#1199

Draft
alsepkow wants to merge 10 commits into
llvm:mainfrom
alsepkow:pr-1187-testing-debug-dumps
Draft

[CI Testing Only] Capture offloader.exe crash dumps via WER (Debug build)#1199
alsepkow wants to merge 10 commits into
llvm:mainfrom
alsepkow:pr-1187-testing-debug-dumps

Conversation

@alsepkow
Copy link
Copy Markdown
Collaborator

[CI testing only — do not merge]

Adds Windows-only crash dump capture for offloader.exe so we can attach full memory dumps to the AMD amdxc64.dll PSO crash bug report.

What this changes

Adds 4 Windows-only steps around Run HLSL Tests in build-and-test-callable.yaml:

  1. Configure WER LocalDumps for offloader.exeC:\CrashDumps (full memory dump, DumpType=2, max 20 retained)
  2. Copy any captured dumps into llvm-project/build/test-results/CrashDumps/ so they sit alongside other test output
  3. Upload them as a per-run artifact crash-dumps-<run_id>-<attempt>-<sku>-<target> (14 day retention)
  4. Cleanup the WER registry key and C:\CrashDumps folder after upload — runs with if: always() so the runner is left in its original state regardless of outcome

Why

The offloader's existing SEH stack-trace printout (LLVM's PrintStackTrace) only gives module-relative offsets. AMD needs register state, thread context, faulting-thread call stack with their internal symbols, and the complete module list — which require a full memory dump.

This PR is based on pr-1187-testing-debug (Debug build, debug layer ON, parallel) — the configuration that has produced the most reliable crashes.

Branch

alsepkow/offload-test-suite:pr-1187-testing-debug-dumps

What you should see if a crash happens

  • A .dmp file (~few hundred MB) appears as a downloadable artifact on the GitHub Actions run page
  • AMD can load it in WinDbg, use their internal symbols, get full faulting-thread context

Notes

  • Runs as HKLM reg writes — assumes the self-hosted runner has admin rights (typical for HLSLPC-AMD01)
  • Cleanup step uses if: always() so even on test failure / cancellation, the registry key is removed

alsepkow and others added 10 commits May 13, 2026 16:07
Both AMD and NVIDIA DirectX configurations have been stable and have higher pass rates than the existing Tier 1 Intel target. Promote them to Tier 1 so they run on every PR. Qualcomm and the Vulkan IHV configurations remain experimental and continue to require the 'test-all' label.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the tier change in docs/CI.md and pr-matrix.yaml so the README status table reflects that these targets now run on every PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per Bob's review feedback, switch from listing AMD/NVIDIA D3D12 combinations via 'include' to a cross-product with 'exclude' for the AMD/NVIDIA Vulkan combinations. As future targets get promoted out of experimental, we can simply remove exclusions rather than adding inclusions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the same cross-product + exclude pattern to the experimental Exec-Tests-Extra job for consistency. As targets are promoted out of experimental, exclusions can be added here in lockstep with their removal from the Tier 1 Exec-Tests-Windows job.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 1eec3eb.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This change is for the draft AMD-testing PR only and should NOT be merged.
Strips the matrix down to only windows-amd x {check-hlsl-d3d12, check-hlsl-clang-d3d12}
so we can quickly iterate on AMD D3D12 stability investigation without spending
CI on Intel/NVIDIA/MacOS/WARP/Vulkan jobs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Companion to the RelWithDebInfo testing draft PR. This branch runs the
same windows-amd D3D12 jobs but with BuildType=Debug to confirm whether
the previously observed Debug-only failures still reproduce.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Windows-only steps around 'Run HLSL Tests' to:
  1. Configure HKLM WER LocalDumps for offloader.exe (full memory dumps
     to C:\CrashDumps, max 20 retained)
  2. Copy any captured dumps into llvm-project/build/test-results/
     CrashDumps so they live alongside test output
  3. Upload them as a per-run GitHub artifact (14 day retention)
  4. Clean up the registry key and dump folder after upload, so the
     runner is left in its original state regardless of run outcome

Goal: capture full-memory crash dumps of the AMD amdxc64.dll PSO
compilation crashes for AMD's debug analysis. The offloader's existing
SEH stack-trace printout shows offsets only; full dumps give AMD
register state, thread context, and complete module list.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The self-hosted HLSLPC-AMD01 runner only has Windows PowerShell
(powershell.exe) installed, not PowerShell 7+ (pwsh). All three
WER-related steps were failing immediately with 'pwsh: command not
found', which short-circuited the job before tests ran.

Switch shell: pwsh -> shell: powershell to match the convention used
by the existing dxdiag step in the same workflow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous cleanup step ran with -ErrorAction SilentlyContinue and no
verification, which meant a partial failure could silently leave the
HLSLPC-AMD01 machine with the LocalDumps registry key or the
C:\CrashDumps folder still present after the job.

Improvements:
  * Configure step now also scrubs any stale state at the start (in
    case a prior aborted run left the regkey or folder behind).
  * Cleanup step uses try/catch around each removal so one failure
    cannot skip subsequent cleanup operations.
  * Cleanup step verifies after the fact that both the registry key
    and the dump folder are gone, and emits an ##[warning] if not so
    we can spot lingering state in the run summary.

Goal: this PR must never permanently reconfigure the AMD test machine,
regardless of whether tests pass, fail, crash, or are cancelled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant