feat(runtime): add TensorRT-RTX runtime cache, dynamic shapes strategy, and native CUDA graph support to C++ runtime#4202
Open
tp5uiuc wants to merge 5 commits intopytorch:mainfrom
Open
Conversation
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 21, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
37ba9f5 to
2b630e8
Compare
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 22, 2026
Address the structural PR feedback by extracting TensorRT-RTX-specific
IRuntimeConfig state into its own type and collapsing the per-feature
appliers that previously scattered `#ifdef TRT_MAJOR_RTX` through
TRTEngine.
What
- New core/runtime/TRTRuntimeConfig.{h,cpp} owns the IRuntimeConfig
shared_ptr plus (on TRT-RTX) the IRuntimeCache, runtime-cache path,
dynamic shapes kernel strategy, CUDA graph strategy, and the
rtx_native_cudagraphs_disabled one-shot flag. All per-feature
appliers live there as public members and are no-ops on non-RTX
builds, keeping the only `#ifdef TRT_MAJOR_RTX` scatter contained
in this new file.
- Strategy fields are now strongly-typed enums
(`DynamicShapesKernelStrategy`, `CudaGraphStrategyOption`) with
matching `to_string`/`to_int` helpers, validated at engine
construction via `to_dynamic_shapes_kernel_strategy` / `to_cuda_
graph_strategy_option` rather than raw int ranges.
- `TRTEngine::recreate_execution_context` is now backend-agnostic:
it calls `runtime_cfg.ensure_initialized`, applies the allocation
strategy, and creates the execution context via
`createExecutionContext(IRuntimeConfig*)`. Both standard TensorRT
and TRT-RTX go through this uniform path; only the three RTX-only
setters (`setRuntimeCache`, `setDynamicShapesKernel
SpecializationStrategy`, `setCudaGraphStrategy`) stay behind an
`#ifdef TRT_MAJOR_RTX` guard inside the struct.
- `~TRTEngine` now wraps cleanup in try/catch and delegates cache
persistence to `TRTRuntimeConfig::save_runtime_cache_nothrow`, so
stack unwinding can no longer propagate a cache-save failure out
of the destructor.
- `save_runtime_cache_nothrow` uses `std::filesystem` + atomic
`tmp+rename` only; file locking is out of scope for this PR and
will be introduced in a follow-up once we pick a portable
mechanism.
- `is_monolithic_capturable` asserts `exec_ctx` is non-null; the
three RTX-only appliers `TORCHTRT_ASSERT` that `config` is live
before dereferencing.
- `disable_rtx_native_cudagraphs` persists the runtime cache before
flipping the strategy so any kernels compiled under the internal
capture survive to the next reload.
- `TRTEngine::to_str` now emits human-readable strategy names (via
`to_string(enum)`) instead of integer codes.
- New serialization indices (`RUNTIME_CACHE_PATH_IDX`, `DYNAMIC_
SHAPES_KERNEL_STRATEGY_IDX`, `CUDA_GRAPH_STRATEGY_IDX`) are now
`#ifdef TRT_MAJOR_RTX`-gated in runtime.h, register_jit_hooks.cpp,
the FlattenedState tuple, the serialize/deserialize constructors,
and `__obj_flatten__`. Standard TRT builds keep `SERIALIZATION_LEN
== 11` so engines serialized there do not carry RTX-only slots.
- Python `_TorchTensorRTModule` reads the RTX-only index accessors
and writes the RTX-only engine-info slots only when
`ENABLED_FEATURES.tensorrt_rtx` is true. Standard TRT users see
no new behavior at runtime.
- Deduplicated `_compiler.py` arguments after rebase on upstream
main where PR pytorch#4184 had already added
`dynamic_shapes_kernel_specialization_strategy`. Kept one copy of
each arg; `cuda_graph_strategy` is threaded through all three
compile() entry points.
Build + tests
- RTX build on A100 / L40S: libtorchtrt.so and libtorchtrt_
runtime.so link clean, no `#ifdef` diagnostics. Pre-commit checks
pass (clang-format, black, isort, ruff, mypy, typos, buildifier).
- All 35 runtime-cache/strategy tests pass; regression across
test_000_runtime_cache.py (Python runtime), test_002_cudagraphs_
cpp.py, test_005_dynamic_allocation.py is green.
Addresses review comments on PR pytorch#4202:
- Guarding of new IDX entries and Python accessors on
TRT_MAJOR_RTX / ENABLED_FEATURES.tensorrt_rtx.
- Encapsulation of RTX-specific state in a dedicated type with
enumerated strategies and transparent standard-TRT/RTX behavior.
- Destructor exception safety.
- Unification of the execution-context creation path via
IRuntimeConfig.
- Removal of file locking for runtime-cache persistence.
- Debug asserts before dereferencing the live IRuntimeConfig.
- Human-readable to_str output.
- save_runtime_cache invoked from disable_rtx_native_cudagraphs.
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 22, 2026
Address PR review comments that asked the new C++ runtime tests be folded into existing feature-level files rather than shipped as parallel `*_cpp.py` files. What - Merge `test_000_runtime_cache_cpp.py` into the existing `test_000_runtime_cache.py`. The file already covered the Python runtime path; two new classes (`TestRuntimeCacheCppPersistence`, `TestCppSerializationIndices`) cover the C++ runtime path via `use_python_runtime=False`, and the serialization-index assertions. Skip on non-RTX builds. - Fold the C++ runtime cases for dynamic shapes kernel specialization strategy into `test_001_dynamic_shapes_kernel_ strategy.py` (introduced upstream in PR pytorch#4184). Two new classes (`TestDynamicShapesKernelStrategyCpp`, `TestDynamicShapesKernel StrategyCppInvalidValue`) exercise lazy/eager/none end-to-end and reject invalid strategy names. The pre-existing Python runtime tests remain untouched. - Rename `test_000_cuda_graph_strategy.py` to `test_001_cuda_graph_ strategy.py` to match the `test_001_*` convention used for L1 RTX-only features. When upstream lands the Python runtime counterpart (PR pytorch#4187), both sets fold into the same file. - Add model-level tests: `test_runtime_cache_models.py` gains a `TestRuntimeCacheCppModels` class exercising ResNet18 through the C++ runtime with warm-cache roundtrip. `test_dynamic_shapes_ kernel_strategy_models.py` gains `TestDynamicShapesKernelStrategy CppModels` covering lazy/eager/none on ResNet18 via the C++ runtime. Verified - 35 passed / 3 skipped in the runtime/ tests (merged file plus test_001 strategy files). - No regression in test_002_cudagraphs_cpp.py (8 passed) or test_005_dynamic_allocation.py (1 passed). Addresses PR pytorch#4202 review comments asking for test file merges and the addition of model-level runtime_cache_models.py / dynamic_shapes_kernel_strategy_models.py coverage.
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 22, 2026
Follow-up to 54f9ccd / 1fa8c82 addressing the second batch of PR pytorch#4202 review feedback. Pure refactor with no user-visible behavior change; all tests green on A100 (35 passed / 3 skipped + 9 regression passed). TRTEngine - Constructor signature simplified: three separate `runtime_cache_path` / `dynamic_shapes_kernel_strategy` / `cuda_graph_strategy` parameters collapsed into a single `TRTRuntimeConfig runtime_cfg` sink parameter. The forwarding ctor std::moves it into the primary ctor, which std::moves it into the member. - String sink parameters (mod_name, serialized_engine, serialized_ metadata) taken by value and moved into members / slugify. - Deserialization constructor routes through the new free function make_runtime_config_from_serialized, which internalizes the TRT_MAJOR_RTX-gated index reads so the constructor itself stays unguarded. - FlattenedState uses a single TRTRTX_FLATTENED_STATE_EXTRAS macro for the three RTX-only tuple entries instead of duplicating the first eleven entries across two branches. - Destructor restored to the pre-refactor structure: torch::cuda:: synchronize runs outside a try block and runtime_cfg.save_runtime_ cache (now noexcept by signature) is called directly. Exception safety is guaranteed by the member's type, not by a defensive try/catch. - __obj_flatten__ and serialize cast enum values via std::underlying_type_t<...> instead of int so serialization stays in lockstep with any future underlying-type change on the enums. TRTRuntimeConfig - Conversion helpers take std::underlying_type_t<Enum> (the declared 32-bit integer type) instead of raw int. Callers at serialization boundaries explicitly std::stoi / static_cast into the right type. - [[nodiscard]] added to to_string, to_dynamic_shapes_kernel_strategy, to_cuda_graph_strategy_option, uses_internal_capture, is_monolithic_ capturable, to_str, and make_runtime_config_from_serialized. - to_string default cases now TORCHTRT_CHECK(false, ...) with the unexpected integer value; std::unreachable is C++23. - set_execution_context_allocation_strategy is now const. - Cache I/O split into two layers: - Free functions load_runtime_cache(path, cache) and save_runtime_cache(path, cache) perform the raw std::filesystem I/O and use TORCHTRT_CHECK on failure -- exception-propagating, easier to test in isolation. - Member TRTRuntimeConfig::save_runtime_cache() is a noexcept wrapper that calls the free function and swallows exceptions via try/catch -- safe from a destructor. The _nothrow suffix is dropped from the member name (the signature now carries that contract). - write_to_str(ostream&) replaced by two functions: a const-correct to_str() -> std::string, and a free operator<<(ostream&, const TRTRuntimeConfig&) that wraps it with "Runtime cfg { ... }" delimiters. TRTEngine::to_str streams the config via the free operator. Python - _settings.py: removed a duplicated dynamic_shapes_kernel_ specialization_strategy field and its duplicated docstring left over from the upstream rebase of PR pytorch#4184 into our changes. Covers review comments 3126538200, 3126541782, 3126547529, 3126549147, 3126682329, 3126683329, 3126693226, 3126715369, 3126725953, 3126736626, 3126738422, 3126745230, 3126747553, 3126749405, 3126764831, 3126772536, 3126786564, 3126803652, 3126816780, 3126818065, 3126818561, 3126819429, 3126823781, 3126840987, 3126846827.
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
commented
Apr 22, 2026
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 23, 2026
…tion Follow-up to a4989c7 addressing the second batch of comments on PR pytorch#4202 plus verification that the non-RTX (standard TensorRT) build path still compiles and tests correctly skip RTX-only suites. Reviewer feedback - FlattenedState: the TRTRTX_FLATTENED_STATE_EXTRAS macro is inlined directly into the tuple parameter pack with a nested `#ifdef TRT_MAJOR_RTX`; no preprocessor macro is introduced, per the reviewer's "Inline and fix" note. - TRTEngine::to_str now calls `runtime_cfg.to_str()` directly rather than relying on the free `operator<<` framing; keeps the engine's existing two-space indentation consistent. - TRTRuntimeConfig free-function I/O helpers (`load_runtime_cache`, `save_runtime_cache`) moved to an anonymous namespace inside TRTRuntimeConfig.cpp and removed from the public header; the member wrapper `TRTRuntimeConfig::save_runtime_cache()` stays in the header (noexcept, catches exceptions from the raw helper). Renamed the internal free save helper to `save_runtime_cache_impl` to avoid clashing with the member of the same name. - Enum conversion helpers `to_string(...)` / `to_dynamic_shapes_kernel_strategy` / `to_cuda_graph_strategy_option` moved to anonymous namespace in the cpp; nothing outside this translation unit needs them now that TRTEngine holds a TRTRuntimeConfig directly. - Replaced `(void)param;` suppression pattern with `TORCHTRT_UNUSED` on the parameter declaration in five places. - Removed the nested `defined(ENABLE_FEATURE_DISABLE_RUNTIME_ ALLOCATION)` guard on `isStreamCapturable`. Instead, the Bazel rule for `//core/runtime:runtime` now sets `ENABLE_FEATURE_DISABLE_RUNTIME_ALLOCATION` as a local_define for the `:rtx_win` and `:rtx_x86_64` configs so the RTX header's feature gate is always on when we're building for RTX, matching the reviewer's invariant. Cross-backend - Python `_TorchTensorRTModule._pack_engine_info` now always validates `dynamic_shapes_kernel_specialization_strategy` and `cuda_graph_strategy` against the allowed name lists, regardless of whether the build is RTX or standard TRT. The engine-info serialization slots are only written on RTX, but the validation runs universally so typos surface early on any backend. Build + test - RTX A100: 35 passed / 3 skipped on new + merged suites; 9 passed regression (test_002_cudagraphs_cpp.py + test_005_dynamic_ allocation.py). Wheel `torch_tensorrt_rtx-2.12.0.dev0+a4989c760`. - Standard TRT A100: wheel `torch_tensorrt-2.12.0.dev0+a4989c760` builds clean without `--use-rtx`. Import smoke shows `tensorrt_rtx=False`, `SERIALIZATION_LEN=11`. 7 passed / 31 skipped (all skips with clean "Runtime cache is only available with TensorRT-RTX" / "CUDA graph strategy is a TensorRT-RTX feature" messages); 9 regression passed. Covers review comments 3126975981, 3127004055, 3127028393, 3127038410, 3127076231, and 3127100282.
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
commented
Apr 23, 2026
tp5uiuc
added a commit
to tp5uiuc/TensorRT
that referenced
this pull request
Apr 23, 2026
Follow-up to 612556b addressing the latest batch of comments on pytorch/TensorRT PR pytorch#4202. Two categories of changes: Reviewer-suggested C++ simplifications (TRTRuntimeConfig.cpp) - load_runtime_cache: inlined the deserialize() call directly into TORCHTRT_CHECK instead of going through an intermediate bool. - ensure_initialized / setRuntimeCache: flipped the if/else so the success branch comes first and the warning + reset lands in the else, matching the reviewer's diff suggestion. - ensure_initialized / setCudaGraphStrategy: inlined the call into the if-condition and dropped the intermediate `bool ok` local. - disable_rtx_native_cudagraphs: same shape fix for the disable-path setCudaGraphStrategy call. Runtime cache durability (TRTEngine.cpp) - recreate_execution_context now flushes the runtime cache before rebuilding the IExecutionContext. The destructor already saves at teardown, but recreate can happen mid-lifetime around profiling toggles and allocator changes; without flushing there, a process kill between an allocator flip and teardown would lose any kernels compiled during the previous context. No-op on standard TensorRT and when no cache path is configured. Test deduplication (tests/py/dynamo/**/test_*{runtime_cache,dynamic_ shapes_kernel_strategy}*.py) Reviewer asked to stop copy-pasting bodies between the Python- and C++-runtime test classes. The persistence, model, and dynamic-shape suites now share one parameterized body that runs on both runtimes: - test_000_runtime_cache.py: TestRuntimeCachePersistence holds the single body; parameterized.expand(_RUNTIMES) fans out over ("python", True) and ("cpp", False). The CppPersistence class, its helpers, and CppSimpleModel are gone; a shared ConvModel with seeded init drives both paths. The C++ parameter skips itself via self.skipTest when torch_tensorrt_runtime is off. - test_001_dynamic_shapes_kernel_strategy.py: the lazy/eager/none test trio in TestDynamicShapesKernelStrategyCpp collapses into a single parameterized test_strategy_inference. Same parameter sweep on TestDynamicShapesKernelStrategySetup.test_strategy_ applied. - test_runtime_cache_models.py: TestRuntimeCacheModels, TestRuntimeCacheDynamicShapes, and TestRuntimeCachePerformance are parameterized over (runtime, use_python_runtime); the Cpp* sibling class is removed. - test_dynamic_shapes_kernel_strategy_models.py: one parameter product (strategy × runtime) drives both the resnet18 and dynamic-batch tests; the Cpp* sibling class is removed. Net: ~200 fewer lines of test code, same coverage, plus symmetry between Python- and C++-runtime test execution. Build + verification - RTX A100 (ipp1-2162, cuda13.0 dev container), wheel torch_tensorrt_rtx-2.12.0.dev0+612556ba0. - runtime/test_000_runtime_cache.py + runtime/test_001_dynamic_shapes_kernel_strategy.py + runtime/test_001_cuda_graph_strategy.py: 36 passed / 3 skipped (up from 35 pre-dedup — the param expansion picks up one extra per-runtime variant on the strategy applied test). - runtime/test_005_dynamic_allocation.py + runtime/test_002_cudagraphs_cpp.py: 9 passed (regression clean). - Model-level subset (resnet18 + dynamic-batch sweep across both runtimes and all three strategies): 10 passed. - Dedicated C++-runtime verification script confirms that use_python_runtime=False produces TorchTensorRTModule (not PythonTorchTensorRTModule), and that the runtime cache is populated and flushed through the C++ path (file size > 0 on engine destruction). Covers review comments 3128480385, 3128493651, 3128747920, 3128754155, 3128759096, and 3128764510.
Contributor
Author
|
Gated by #4164 which needs to merge first |
…y, and native CUDA graph support to C++ runtime - Introduce IRuntimeConfig scaffolding and bump ABI to v9 - Add runtime cache to C++ runtime for TensorRT-RTX - Add dynamic shapes kernel specialization strategy to C++ runtime - Add TensorRT-RTX native CUDA graph strategy to C++ runtime - Extract TRTRuntimeConfig - Consolidate C++ runtime tests and add model-level coverage
e852123 to
2705f49
Compare
…xecution_context release_nccl_comm() previously rebuilt the IExecutionContext via direct calls to ICudaEngine::createExecutionContext, bypassing the TRTRuntimeConfig plumbing introduced earlier in this PR. On that path the RTX runtime cache was not flushed before context teardown, and the dynamic shapes kernel specialization and CUDA graph strategies stored on TRTRuntimeConfig were not re-applied to the new context. Delegate to recreate_execution_context() instead. It saves the runtime cache, ensures TRTRuntimeConfig is initialized, sets the allocation strategy from resource_allocation_strategy, and creates the new exec context via createExecutionContext(runtime_cfg.config.get()), keeping all strategies live across the NCCL bind/release cycle.
3 tasks
cuda_graph_strategy and dynamic_shapes_kernel_specialization_strategy are TRT-RTX-only at runtime, but they are accepted on every build through the public compile() / CompilationSettings surface. Their string-to-enum lookup lived inside the 'if ENABLED_FEATURES.tensorrt_rtx:' block in _pack_engine_info(), so on a standard (non-RTX) build a typo like cuda_graph_strategy="wholee_graph_capture" was silently dropped instead of raising. Hoist the membership check into TorchTensorRTModule.__init__ so that invalid strategy names always raise ValueError, regardless of backend. The RTX-gated index population in _pack_engine_info() keeps reading the maps unchanged -- only the redundant validation moves. Fixes the L1 dynamo core tests on standard-TensorRT Windows: TestCudaGraphStrategyInvalidValue::test_invalid_strategy_raises TestDynamicShapesKernelStrategyCppInvalidValue::test_invalid_strategy_raises
The C++ runtime config introduced in this branch unconditionally referenced
nvinfer1::IRuntimeConfig, which is only available on TensorRT-RTX and on
standard TensorRT >= 10.11. The TensorRT shipped with the Jetpack l4t-r36.4
toolchain (@tensorrt_l4t) predates 10.11 and does not export this type, so
the aarch64-jetpack build fails:
./core/runtime/TRTRuntimeConfig.h:47:29: error: 'IRuntimeConfig' is not
a member of 'nvinfer1'
Inject a TRT_HAS_IRUNTIME_CONFIG macro from core/runtime/BUILD via a
'defines = select({...})' on //core/runtime:runtime. The macro is set on
every build configuration except :jetpack (RTX, SBSA, Windows, default
x86_64 Linux). This is symmetric with how TRT_MAJOR_RTX and
ENABLE_FEATURE_DISABLE_RUNTIME_ALLOCATION are already injected per-config
in the same target.
In the C++ sources, gate every IRuntimeConfig-using site with
'#ifdef TRT_HAS_IRUNTIME_CONFIG':
- the 'config' member of TRTRuntimeConfig
- ensure_initialized's createRuntimeConfig + downstream setter calls
- set_execution_context_allocation_strategy (a no-op on older TRT)
- recreate_execution_context's createExecutionContext(config*) call,
with a fallback to the legacy createExecutionContext() no-arg overload
The pre-existing TRT_MAJOR_RTX-gated runtime_cache / dynamic-shapes /
cuda-graph blocks are a strict subset of TRT_HAS_IRUNTIME_CONFIG, so
behavior on TRT-RTX and on modern standard TensorRT is unchanged.
Note: macro semantics are now 'is the build config named jetpack?' rather
than 'does TRT actually export IRuntimeConfig?'. If @tensorrt_l4t ever
bumps to 10.11+, the BUILD select needs to be updated to flip the gate on
for jetpack.
ad76f73 to
940d99e
Compare
tp5uiuc
commented
May 9, 2026
Comment on lines
+13
to
+19
| // `TRT_HAS_IRUNTIME_CONFIG` is injected by `core/runtime/BUILD` and reflects whether | ||
| // the linked TensorRT exports `nvinfer1::IRuntimeConfig` (introduced in TensorRT | ||
| // 10.11). It is defined on every build configuration except `:jetpack`, whose | ||
| // l4t-r36.4 TensorRT bundle predates 10.11. On the Jetpack path the runtime falls | ||
| // back to the legacy `ICudaEngine::createExecutionContext()` no-arg overload and | ||
| // treats the rest of this struct's IRuntimeConfig state as inert. | ||
|
|
Contributor
Author
There was a problem hiding this comment.
Remove this comment.
tp5uiuc
commented
May 9, 2026
| // to the legacy createExecutionContext() overload, which uses TensorRT's built-in | ||
| // default allocation strategy. The set_execution_context_allocation_strategy call | ||
| // above is a no-op on this path. | ||
| exec_ctx = make_trt(cuda_engine->createExecutionContext()); |
Contributor
Author
There was a problem hiding this comment.
You might want to set the allocation strategy here too.
this->exec_ctx =
make_trt(cuda_engine->createExecutionContext(
this->resource_allocation_strategy == TRTEngine::ResourceAllocationStrategy::kDynamic ? nvinfer1::ExecutionContextAllocationStrategy::kUSER_MANAGED : nvinfer1::ExecutionContextAllocationStrategy::kSTATIC));
See if
resource_allocation_strategy == ResourceAllocationStrategy::kDynamic
? nvinfer1::ExecutionContextAllocationStrategy::kUSER_MANAGED
: nvinfer1::ExecutionContextAllocationStrategy::kSTATIC
can be extracted out as a common const variable and set both TRT_HAS_IRUNTIME_CONFIG and non TRT_HAS_IRUNTIME_CONFIG with it.
…xt on Jetpack Two minor cleanups from review on the IRuntimeConfig gating commit: 1. Drop the BUILD-explainer comment block from TRTRuntimeConfig.h. The BUILD file already documents why TRT_HAS_IRUNTIME_CONFIG exists; the header doesn't need to repeat it. 2. The previous Jetpack fallback used cuda_engine->createExecutionContext() (no-arg), which silently dropped the user's resource_allocation_strategy choice. The legacy createExecutionContext(ExecutionContextAllocationStrategy) overload is available on pre-10.11 TensorRT, so use it -- and extract the kDynamic ? kUSER_MANAGED : kSTATIC ternary as a const local so both the IRuntimeConfig path (via set_execution_context_allocation_strategy) and the legacy path see the same value.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Extends three TensorRT-RTX runtime features that landed on the Python runtime (
PythonTorchTensorRTModule) to the C++ runtime path (TorchTensorRTModule→core/runtime/TRTEngine). All three features center onnvinfer1::IRuntimeConfig, which the C++ runtime previously did not use — it calledcreateExecutionContext(...)directly at four sites.Features ported in this stack:
Without this PR, users on the C++ runtime path (TorchScript deployments,
use_python_runtime=False) cannot access any of these TRT-RTX features, and runtime-cache warm-start savings (~8× measured in #4180) are unavailable on that path.Commits
Original stack (reviewable independently):
feat(runtime): introduce IRuntimeConfig scaffolding and bump ABI to v9— shared infra. Adds theIRuntimeConfig/IRuntimeCachemembers (RTX-only), a privaterecreate_execution_context()helper replacing 4 directcreateExecutionContextcall sites, three newSerializedInfoIndexentries (RUNTIME_CACHE_PATH_IDX,DYNAMIC_SHAPES_KERNEL_STRATEGY_IDX,CUDA_GRAPH_STRATEGY_IDX) with a single ABI bump8 → 9. Python settings +compile()parameter threading. No behavior change — the per-featureapply_*appliers are empty stubs filled in by subsequent commits.feat(runtime): add runtime cache to C++ runtime for TensorRT-RTX— mirror of feat: add runtime cache API for TensorRT-RTX #4180. Load on engine setup, atomic save on destructor (tmp + rename).feat(runtime): add dynamic shapes kernel specialization strategy to C++ runtime— mirror of feat: add dynamic shapes kernel specialization strategy for TRT-RTX #4184. WiresIRuntimeConfig::setDynamicShapesKernelSpecializationStrategy.feat(runtime): add TensorRT-RTX native CUDA graph strategy to C++ runtime— mirror of feat: add TRT-RTX native CUDA graph support #4187. WiresIRuntimeConfig::setCudaGraphStrategyand makesexecute_engine.cppTensorRT-RTX-aware: bypasses manualat::cuda::CUDAGraphcapture on RTX (TRT-RTX handles it internally) and usescudaStreamIsCapturingto detect outer whole-graph capture.test: consolidate C++ runtime tests— folds C++ runtime cases into the existingtest_000_runtime_cache.pyandtest_001_dynamic_shapes_kernel_strategy.py; adds model-level coverage intest_runtime_cache_models.pyandtest_dynamic_shapes_kernel_strategy_models.py.Review-response commits (on top of the feature stack):
refactor(runtime): extract TRTRuntimeConfig— moves all TensorRT-RTX-specific IRuntimeConfig state intocore/runtime/TRTRuntimeConfig.{h,cpp}, collapses three separate constructor args into a singleTRTRuntimeConfigsink, uses enums (DynamicShapesKernelStrategy,CudaGraphStrategyOption), makes the destructor exception-safe, and contains the#ifdef TRT_MAJOR_RTXscatter to the new TU.refactor(runtime): third-round review polish + cross-backend verification— inlines the FlattenedState macro, moves enum helpers + cache I/O into an anonymous namespace in the cpp, replaces(void)param;withTORCHTRT_UNUSED, usesstd::underlying_type_tthroughout, adds[[nodiscard]], addsENABLE_FEATURE_DISABLE_RUNTIME_ALLOCATIONas alocal_definefor the RTX Bazel configs so the runtime-allocation feature gate matches the RTX header's expectation.Why bundle three features in one PR
All three features require an
IRuntimeConfigon the engine, a single ABI bump, and extensions to the same serialization/deserialization code paths. Splitting into three independent PRs would trigger three consecutive ABI bumps and triple the surface area for backward-compat fallout. Keeping them in one stack keeps ABI changes atomic while still giving reviewers clean per-feature diffs.Type of change
"8"to"9"— old.pt/.epfiles targeting the C++ runtime will failverify_serialization_fmtwith a clear error, as with every prior ABI bump)"lazy","disabled") keep existing behavior; existing docs for the Python-runtime runtime cache already cover the conceptChecklist
CompilationSettings; full user-guide updates pending the feature PRs on the Python path)test_000_runtime_cache.py,test_001_dynamic_shapes_kernel_strategy.py,test_001_cuda_graph_strategy.py; model-level coverage intest_runtime_cache_models.py,test_dynamic_shapes_kernel_strategy_models.py)Test plan
Cross-backend verification on an A100 (ipp1 node) inside the
main-native-x86_64-ubuntu24.04-cuda13.0dev container, PyTorch nightly 2.13.0.dev20260420, CUDA 13.0.TensorRT-RTX build
Wheel:
torch_tensorrt_rtx-2.12.0.dev0+612556ba0(built withpython3 setup.py bdist_wheel --use-rtx, TensorRT-RTX 1.4.0.76).Import smoke:
ENABLED_FEATURES.tensorrt_rtx = TrueABI_VERSION = 9SERIALIZATION_LEN = 14RUNTIME_CACHE_PATH_IDX = 11,DYNAMIC_SHAPES_KERNEL_STRATEGY_IDX = 12,CUDA_GRAPH_STRATEGY_IDX = 13Breakdown of the 35 RTX-side passes:
TestRuntimeCacheSetup), persistence and warm-cache roundtrip on both Python and C++ runtimes (TestRuntimeCachePersistence,TestRuntimeCacheCppPersistence), concurrency/filelock (TestRuntimeCacheConcurrency), timing-cache skip (TestTimingCacheSkipped), and serialization-index registration (TestCppSerializationIndices).TestDynamicShapesKernelStrategySetup), full{lazy, eager, none}end-to-end through both runtimes (TestDynamicShapesKernelStrategyCpp), dynamic-shape traversal, and invalid-value rejection (TestDynamicShapesKernelStrategyCppInvalidValue).TestCudaGraphStrategySettings),{disabled, whole_graph_capture}via C++ runtime (TestCudaGraphStrategyCpp), RTX-native override whenset_cudagraphs_mode(True)is combined with a strategy, repeated inference, and invalid-value rejection (TestCudaGraphStrategyInvalidValue).Regression:
Model-level: e2e ResNet18 compilation + inference via the C++ runtime path with each
{lazy, eager, none}strategy and with runtime cache warm-roundtrip (added intests/py/dynamo/models/).Standard TensorRT build
Wheel:
torch_tensorrt-2.12.0.dev0+612556ba0(built with plainpython3 setup.py bdist_wheel, TensorRT 10.16.0).Import smoke:
ENABLED_FEATURES.tensorrt_rtx = FalseABI_VERSION = 9SERIALIZATION_LEN = 11(no RTX-only slots in the FlattenedState)register_jit_hooks.cppThe 7 passes on standard TRT are:
TestNonRTXUnchanged— confirms the existing Python-runtime paths are unaffected (runtime cache / timing cache behavior, noruntime_configmember) — 2 tests.TestDynamicShapesKernelStrategyNonRTX::test_setting_ignored_on_non_rtx— confirms the newdynamic_shapes_kernel_specialization_strategysetting is silently ignored on non-RTX.TestCudaGraphStrategySettings::{test_default_value, test_settable_values}—CompilationSettingsaccepts the new fields on any backend.TestDynamicShapesKernelStrategyCppInvalidValue::test_invalid_strategy_raises+TestCudaGraphStrategyInvalidValue::test_invalid_strategy_raises— unknown strategy names are rejected at_pack_engine_infotime even on non-RTX (validation is cross-platform; the engine-info serialization slots themselves are RTX-only).The 31 skips all carry clean messages (sample):
Regression on standard TRT:
Summary
torch_tensorrt_rtx-2.12.0.dev0+612556ba0torch_tensorrt-2.12.0.dev0+612556ba00 failures on either backend; all RTX-gated suites skip cleanly with descriptive messages on the standard TRT build.