feat(runtime): file-lock the TRT-RTX runtime cache#4237
feat(runtime): file-lock the TRT-RTX runtime cache#4237tp5uiuc wants to merge 3 commits intopytorch:mainfrom
Conversation
…y, and native CUDA graph support to C++ runtime - Introduce IRuntimeConfig scaffolding and bump ABI to v9 - Add runtime cache to C++ runtime for TensorRT-RTX - Add dynamic shapes kernel specialization strategy to C++ runtime - Add TensorRT-RTX native CUDA graph strategy to C++ runtime - Extract TRTRuntimeConfig - Consolidate C++ runtime tests and add model-level coverage
…xecution_context release_nccl_comm() previously rebuilt the IExecutionContext via direct calls to ICudaEngine::createExecutionContext, bypassing the TRTRuntimeConfig plumbing introduced earlier in this PR. On that path the RTX runtime cache was not flushed before context teardown, and the dynamic shapes kernel specialization and CUDA graph strategies stored on TRTRuntimeConfig were not re-applied to the new context. Delegate to recreate_execution_context() instead. It saves the runtime cache, ensures TRTRuntimeConfig is initialized, sets the allocation strategy from resource_allocation_strategy, and creates the new exec context via createExecutionContext(runtime_cfg.config.get()), keeping all strategies live across the NCCL bind/release cycle.
…safety
Adds a cross-platform RAII file-lock primitive (core/util/file_lock.{h,cpp})
matching py-filelock's lock-file convention so the Python and C++ runtimes
sharing a runtime_cache_path do not race the rename and silently drop
compiled kernels.
- Unix backend uses BSD flock(2) -- the primitive py-filelock uses, not
POSIX fcntl record locks (which live in an independent namespace and
would silently fail to interop on Linux).
- Windows backend uses LockFileEx on byte (0,1) -- matches the byte range
msvcrt.locking(..., 1) locks on the Python side.
- Platform branch is hidden behind a LockHandle struct with move-and-swap
semantics, so callers only see a single FileLock RAII type.
- Shared/exclusive modes: load takes shared (multiple readers OK), save
takes exclusive. Python's FileLock is exclusive-only but conflicts
correctly against C++ shared holders since both use the flock namespace.
- 10s acquire timeout via 50ms-cadence poll loop, matching the Python
side's timeout=10. Lock-file path is <cache_path>.lock.
Wired into load_runtime_cache and save_runtime_cache_impl, with the
FileLock scoped to just the I/O block (save writes in-place under the
lock, no tmp+rename). Errors propagate via TORCHTRT_CHECK; the existing
try/catch in ensure_initialized and the noexcept save_runtime_cache
wrapper catch and log, so external behavior on contention is unchanged.
Tests:
- tests/cpp/test_file_lock.cpp: 12 unit tests covering exclusive/shared
contention, timeout edges, RAII release, move semantics, no-unlink-on-
release, and a same-namespace flock(2) interop check that verifies the
C++ primitive conflicts with raw flock locks (what py-filelock uses).
- tests/py/dynamo/runtime/test_000_runtime_cache.py:
- parameterizes test_filelock_works and test_sequential_save_load over
both runtimes
- test_python_lock_blocks_cpp_save: an externally-held py-filelock causes
the C++ save to time out silently, leaving the cache file unmodified;
a fresh save after release succeeds
- test_filelock_cross_runtime_parallel: two subprocesses (one Python-
runtime, one C++-runtime) compile against a shared cache_path and both
succeed. Subprocesses rather than threads because torch.export has
thread-unsafe TLS, but cross-process is the real-world locking
scenario anyway.
| REQUIRES_OUTPUT_ALLOCATOR_IDX, | ||
| RESOURCE_ALLOCATION_STRATEGY_IDX, | ||
| REQUIRES_NATIVE_MULTIDEVICE_IDX, | ||
| #ifdef TRT_MAJOR_RTX |
There was a problem hiding this comment.
I would rather not ifdef the serialization format because users might accidentally cross packages here. We can make optional slots with a sentinel value. But TRT produced programs and TRT-RTX programs should share the format
There was a problem hiding this comment.
Also are these properties of the engine or are they runtime mode configurations? The point of this interface is the bare minimum information to reconstruct the program from disk
|
|
||
| TRTEngine::TRTEngine( | ||
| const std::string& serialized_engine, | ||
| std::string serialized_engine, |
There was a problem hiding this comment.
why does this need to be a deep copy?
| std::tuple("resource_allocation_strategy", serialized_info[RESOURCE_ALLOCATION_STRATEGY_IDX]), | ||
| std::tuple("requires_native_multidevice", serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX])); | ||
| std::tuple("requires_native_multidevice", serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX]) | ||
| #ifdef TRT_MAJOR_RTX |
| this->resource_allocation_strategy == ResourceAllocationStrategy::kDynamic ? "1" : "0"; | ||
| serialized_info[REQUIRES_NATIVE_MULTIDEVICE_IDX] = this->requires_native_multidevice ? "1" : "0"; | ||
| // rank/world_size are runtime facts (may differ at load time); not serialized. | ||
| #ifdef TRT_MAJOR_RTX |
| torch.ops.tensorrt.SERIALIZATION_LEN() | ||
| ) # 15 (RTX) / 12 (standard) | ||
|
|
||
| _DYNAMIC_SHAPES_KERNEL_STRATEGY_MAP: Dict[str, int] = { |
There was a problem hiding this comment.
Are these things that need user apis?
| _CUDA_GRAPH_STRATEGY_MAP: Dict[str, int] = { | ||
| "disabled": 0, | ||
| "whole_graph_capture": 1, | ||
| } |
| autocast_calibration_dataloader (Optional[torch.utils.data.DataLoader]): The dataloader to use for autocast calibration. Default is None. | ||
| offload_module_to_cpu (bool): Offload the model to CPU to reduce memory footprint during compilation | ||
| dynamically_allocate_resources (bool): Dynamically allocate resources for TensorRT engines | ||
| cuda_graph_strategy (str): TensorRT-RTX CUDA graph strategy: "disabled" (default) or "whole_graph_capture" (let TensorRT-RTX manage CUDA graph capture/replay internally). When set and combined with `torch_tensorrt.runtime.set_cudagraphs_mode(True)` on RTX, overrides manual capture. Not used for standard TensorRT. |
There was a problem hiding this comment.
Runtime mode controls should be controlled via context managers rather than passed in at compile time. Only information that is fixed at runtime needs to be here
| @@ -0,0 +1,187 @@ | |||
| #include <atomic> | |||
There was a problem hiding this comment.
move into //tests/core/runtime or //tests/core/util
Summary
Adds a cross-platform RAII file-lock primitive (
core/util/file_lock.{h,cpp}) wired intoload_runtime_cache(shared) andsave_runtime_cache_impl(exclusive) so the Python and C++ TRT-RTX runtimes sharing aruntime_cache_pathdo not race the rename and silently drop compiled kernels. Follow-up to a reviewer ask on #4202 to land locking as a separate pass.Backend
Matches the
filelockPython library so the two interop on the same<cache>.lockfile:flock(2)— not POSIXfcntl(F_SETLK), which lives in an independent namespace and would silently fail to interop on Linux.LockFileExon byte range(0, 1)— matchesmsvcrt.locking(..., 1)on the Python side.flock(2)has no native timeout, sotry_lock_foris a 50ms-cadence poll loop with a 10s default matchingfilelock'sacquire(timeout=10). Errors propagate viaTORCHTRT_CHECK; the existingtry/catchinensure_initializedand thenoexcept save_runtime_cachewrapper preserve behavior on contention.Tests
tests/cpp/test_file_lock.cpp): 12 gtest cases — exclusive/shared/mixed contention, timeout edges, RAII release, move semantics, no-unlink-on-release, and a same-namespace `flock(2)` interop check.tests/py/dynamo/runtime/test_000_runtime_cache.py): parameterizes `test_filelock_works` and `test_sequential_save_load` over both runtimes; adds `test_python_lock_blocks_cpp_save` (Python `filelock` blocks C++ save → timeout, cache unchanged, post-release save succeeds) and `test_filelock_cross_runtime_parallel` (two subprocesses, one per runtime, on a shared `cache_path`).Local (A100, RTX): `bazel test //tests/cpp:test_file_lock` → 12/12 pass; full `test_000_runtime_cache.py` → all pass + RTX-gated skips.
Notes
Test plan