feat(runtime): file-lock the TRT-RTX runtime cache by tp5uiuc · Pull Request #4237 · pytorch/TensorRT

tp5uiuc · 2026-05-06T10:53:22Z

Summary

Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp}) wired into load_runtime_cache (shared) and save_runtime_cache_impl (exclusive) so the Python and C++ TRT-RTX runtimes sharing a runtime_cache_path do not race the rename and silently drop compiled kernels. Follow-up to a reviewer ask on #4202 to land locking as a separate pass.

Backend

Matches the filelock Python library so the two interop on the same <cache>.lock file:

Linux/macOS: flock(2) — not POSIX fcntl(F_SETLK), which lives in an independent namespace and would silently fail to interop on Linux.
Windows: LockFileEx on byte range (0, 1) — matches msvcrt.locking(..., 1) on the Python side.

flock(2) has no native timeout, so try_lock_for is a 50ms-cadence poll loop with a 10s default matching filelock's acquire(timeout=10). Errors propagate via TORCHTRT_CHECK; the existing try/catch in ensure_initialized and the noexcept save_runtime_cache wrapper preserve behavior on contention.

Tests

C++ (tests/cpp/test_file_lock.cpp): 12 gtest cases — exclusive/shared/mixed contention, timeout edges, RAII release, move semantics, no-unlink-on-release, and a same-namespace `flock(2)` interop check.
Python (tests/py/dynamo/runtime/test_000_runtime_cache.py): parameterizes `test_filelock_works` and `test_sequential_save_load` over both runtimes; adds `test_python_lock_blocks_cpp_save` (Python `filelock` blocks C++ save → timeout, cache unchanged, post-release save succeeds) and `test_filelock_cross_runtime_parallel` (two subprocesses, one per runtime, on a shared `cache_path`).

Local (A100, RTX): `bazel test //tests/cpp:test_file_lock` → 12/12 pass; full `test_000_runtime_cache.py` → all pass + RTX-gated skips.

Notes

Depends on feat(runtime): add TensorRT-RTX runtime cache, dynamic shapes strategy, and native CUDA graph support to C++ runtime #4202 — should merge after that.
More backend / API rationale and review history in feat(runtime): file-lock the TRT-RTX runtime cache for cross-runtime safety tp5uiuc/TensorRT#1.

Test plan

C++ unit tests pass (`bazel test //tests/cpp:test_file_lock`)
Python e2e pass (`pytest tests/py/dynamo/runtime/test_000_runtime_cache.py`)
Pre-commit clean (clang-format, black, isort, ruff, buildifier, typos)

…y, and native CUDA graph support to C++ runtime - Introduce IRuntimeConfig scaffolding and bump ABI to v9 - Add runtime cache to C++ runtime for TensorRT-RTX - Add dynamic shapes kernel specialization strategy to C++ runtime - Add TensorRT-RTX native CUDA graph strategy to C++ runtime - Extract TRTRuntimeConfig - Consolidate C++ runtime tests and add model-level coverage

…xecution_context release_nccl_comm() previously rebuilt the IExecutionContext via direct calls to ICudaEngine::createExecutionContext, bypassing the TRTRuntimeConfig plumbing introduced earlier in this PR. On that path the RTX runtime cache was not flushed before context teardown, and the dynamic shapes kernel specialization and CUDA graph strategies stored on TRTRuntimeConfig were not re-applied to the new context. Delegate to recreate_execution_context() instead. It saves the runtime cache, ensures TRTRuntimeConfig is initialized, sets the allocation strategy from resource_allocation_strategy, and creates the new exec context via createExecutionContext(runtime_cfg.config.get()), keeping all strategies live across the NCCL bind/release cycle.

…safety Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp}) matching py-filelock's lock-file convention so the Python and C++ runtimes sharing a runtime_cache_path do not race the rename and silently drop compiled kernels. - Unix backend uses BSD flock(2) -- the primitive py-filelock uses, not POSIX fcntl record locks (which live in an independent namespace and would silently fail to interop on Linux). - Windows backend uses LockFileEx on byte (0,1) -- matches the byte range msvcrt.locking(..., 1) locks on the Python side. - Platform branch is hidden behind a LockHandle struct with move-and-swap semantics, so callers only see a single FileLock RAII type. - Shared/exclusive modes: load takes shared (multiple readers OK), save takes exclusive. Python's FileLock is exclusive-only but conflicts correctly against C++ shared holders since both use the flock namespace. - 10s acquire timeout via 50ms-cadence poll loop, matching the Python side's timeout=10. Lock-file path is <cache_path>.lock. Wired into load_runtime_cache and save_runtime_cache_impl, with the FileLock scoped to just the I/O block (save writes in-place under the lock, no tmp+rename). Errors propagate via TORCHTRT_CHECK; the existing try/catch in ensure_initialized and the noexcept save_runtime_cache wrapper catch and log, so external behavior on contention is unchanged. Tests: - tests/cpp/test_file_lock.cpp: 12 unit tests covering exclusive/shared contention, timeout edges, RAII release, move semantics, no-unlink-on- release, and a same-namespace flock(2) interop check that verifies the C++ primitive conflicts with raw flock locks (what py-filelock uses). - tests/py/dynamo/runtime/test_000_runtime_cache.py: - parameterizes test_filelock_works and test_sequential_save_load over both runtimes - test_python_lock_blocks_cpp_save: an externally-held py-filelock causes the C++ save to time out silently, leaving the cache file unmodified; a fresh save after release succeeds - test_filelock_cross_runtime_parallel: two subprocesses (one Python- runtime, one C++-runtime) compile against a shared cache_path and both succeed. Subprocesses rather than threads because torch.export has thread-unsafe TLS, but cross-process is the real-world locking scenario anyway.

narendasan · 2026-05-06T21:22:48Z

  REQUIRES_OUTPUT_ALLOCATOR_IDX,
  RESOURCE_ALLOCATION_STRATEGY_IDX,
  REQUIRES_NATIVE_MULTIDEVICE_IDX,
+#ifdef TRT_MAJOR_RTX


I would rather not ifdef the serialization format because users might accidentally cross packages here. We can make optional slots with a sentinel value. But TRT produced programs and TRT-RTX programs should share the format

Also are these properties of the engine or are they runtime mode configurations? The point of this interface is the bare minimum information to reconstruct the program from disk

narendasan · 2026-05-06T21:22:52Z


 TRTEngine::TRTEngine(
-    const std::string& serialized_engine,
+    std::string serialized_engine,


why does this need to be a deep copy?

narendasan · 2026-05-06T21:23:24Z

      std::tuple("resource_allocation_strategy", serialized_info[RESOURCE_ALLOCATION_STRATEGY_IDX]),
-      std::tuple("requires_native_multidevice", serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX]));
+      std::tuple("requires_native_multidevice", serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX])
+#ifdef TRT_MAJOR_RTX


See above comment

narendasan · 2026-05-06T21:23:38Z

      this->resource_allocation_strategy == ResourceAllocationStrategy::kDynamic ? "1" : "0";
  serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX] = this->requires_native_multidevice ? "1" : "0";
  // rank/world_size are runtime facts (may differ at load time); not serialized.
+#ifdef TRT_MAJOR_RTX


narendasan · 2026-05-06T21:28:00Z

+        torch.ops.tensorrt.SERIALIZATION_LEN()
+    )  # 15 (RTX) / 12 (standard)
+
+_DYNAMIC_SHAPES_KERNEL_STRATEGY_MAP: Dict[str, int] = {


Are these things that need user apis?

narendasan · 2026-05-06T21:28:08Z

+_CUDA_GRAPH_STRATEGY_MAP: Dict[str, int] = {
+    "disabled": 0,
+    "whole_graph_capture": 1,
+}


narendasan · 2026-05-06T21:29:48Z

        autocast_calibration_dataloader (Optional[torch.utils.data.DataLoader]): The dataloader to use for autocast calibration. Default is None.
        offload_module_to_cpu (bool): Offload the model to CPU to reduce memory footprint during compilation
        dynamically_allocate_resources (bool): Dynamically allocate resources for TensorRT engines
+        cuda_graph_strategy (str): TensorRT-RTX CUDA graph strategy: "disabled" (default) or "whole_graph_capture" (let TensorRT-RTX manage CUDA graph capture/replay internally). When set and combined with `torch_tensorrt.runtime.set_cudagraphs_mode(True)` on RTX, overrides manual capture. Not used for standard TensorRT.


Runtime mode controls should be controlled via context managers rather than passed in at compile time. Only information that is fixed at runtime needs to be here

narendasan · 2026-05-06T21:30:22Z

@@ -0,0 +1,187 @@
+#include <atomic>


move into //tests/core/runtime or //tests/core/util

tp5uiuc added 3 commits May 6, 2026 02:01

meta-cla Bot added the cla signed label May 6, 2026

github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels May 6, 2026

github-actions Bot requested a review from narendasan May 6, 2026 10:53

narendasan requested review from apbose and zewenli98 May 6, 2026 16:25

narendasan reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): file-lock the TRT-RTX runtime cache#4237

feat(runtime): file-lock the TRT-RTX runtime cache#4237
tp5uiuc wants to merge 3 commits intopytorch:mainfrom
tp5uiuc:feat/runtime-cache-file-lock

tp5uiuc commented May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

narendasan May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tp5uiuc commented May 6, 2026

Summary

Backend

Tests

Notes

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants