Python runtime rework #4222
Conversation
narendasan
left a comment
There was a problem hiding this comment.
Not sure this PR is clean yet
daad718 to
a328e80
Compare
a328e80 to
00e4ea2
Compare
00e4ea2 to
168bdea
Compare
|
I will revert the commit |
2b3b974 to
c6b8edf
Compare
There was a problem hiding this comment.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-05-04 23:59:29.343126+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py 2026-05-04 23:59:48.643676+00:00
@@ -38,10 +38,11 @@
Optional[SerializedTensorRTEngineFmt],
List[str],
List[str],
]
+
class TorchTensorRTModule(torch.nn.Module): # type: ignore[misc]
"""``nn.Module`` that runs a TensorRT engine inside PyTorch.
When the C++ Torch-TensorRT runtime is available, execution uses
``torch.classes.tensorrt.Engine`` and ``torch.ops.tensorrt.execute_engine``.
@@ -157,10 +158,11 @@
if k == "engine":
object.__setattr__(result, k, v) # shallow: reuse the same C++ Engine
else:
object.__setattr__(result, k, copy.deepcopy(v, memo))
return result
+
def _resolve_target_device(self) -> torch.device:
"""Resolve the engine's target CUDA device from compilation settings."""
if self.settings.device is not None:
return torch.device(f"cuda:{self.settings.device.gpu_id}")
return torch.device(f"cuda:{torch.cuda.current_device()}")
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/partitioning/test_000_resource_partitioning.py 2026-05-04 23:59:29.378025+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/partitioning/test_000_resource_partitioning.py 2026-05-04 23:59:52.160386+00:00
@@ -17,10 +17,11 @@
ResourcePartitioner,
)
# Fixed RSS value to make memory-budget calculations deterministic.
_FIXED_RSS_BYTES = 512 * 1024 * 1024 # 512 MB
+
class TestResourcePartitioning(TestCase):
def test_atomic_subgraph_correction(self):
class net(nn.Module):
def __init__(self):
@@ -110,7 +111,8 @@
# The fusion should be fixed after the step
partitioner._verify_all_fusion_nodes_in_same_subgraph(new_subgraphs)
break
+
if __name__ == "__main__":
run_tests()
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/partitioning/test_001_resource_partitioning.py 2026-05-04 23:59:29.378025+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/partitioning/test_001_resource_partitioning.py 2026-05-04 23:59:52.471763+00:00
@@ -25,10 +25,11 @@
resource_partition,
)
# Fixed RSS value used across all tests to make memory-budget calculations deterministic.
_FIXED_RSS_BYTES = 512 * 1024 * 1024 # 512 MB
+
class TestResourcePartitioning(TestCase):
def test_resource_partitioning(self):
class net(nn.Module):
def __init__(self):
@@ -413,7 +414,8 @@
== 4
), "The graph should have 4 accelerated subgraphs"
torch._dynamo.reset()
+
if __name__ == "__main__":
run_tests()b04e443 to
edd97c1
Compare
15503b0 to
42972f0
Compare
42972f0 to
29673a1
Compare
narendasan
left a comment
There was a problem hiding this comment.
Cool I think this is looking good. One thing I think this shows is maybe we can keep the use_python_runtime but only register with torch in python only? then we arent losing user flexibility
| int64_t get_streamable_device_memory_budget(); | ||
| int64_t get_automatic_device_memory_budget(); | ||
| std::vector<at::Tensor> infer_outputs(std::vector<std::vector<int64_t>> input_shapes); | ||
| void set_pre_allocated_outputs(bool enable); |
| return TorchTensorRTModule( | ||
| serialized_engine=serialized_interpreter_result.serialized_engine, | ||
| input_binding_names=list(serialized_interpreter_result.input_names), | ||
| output_binding_names=list(serialized_interpreter_result.output_names), |
There was a problem hiding this comment.
Are we dropping use_python_runtime then?
| @@ -320,6 +320,7 @@ def no_op_placeholder_for_execute_engine( | |||
| serialized_metadata: str, | |||
| serialized_target_platform: str, | |||
| serialized_require_output_allocator: str, | |||
There was a problem hiding this comment.
Could you also add the md_trt tag too, I might have forgotten
| ABI_VERSION = "9" | ||
|
|
||
|
|
||
| class SerializedInfoIndex(IntEnum): |
There was a problem hiding this comment.
Feat. request for later is to move this into some sort of data that both C++ and python can load at build time
| # original __init__ kwarg may have been False, but a saved engine | ||
| # can still pin use_python_runtime=True via the settings blob. | ||
| self._use_python_runtime = ( | ||
| getattr(self.settings, "use_python_runtime", False) |
Unify Python and C++ TensorRT runtimes
PythonTorchTensorRTModule is removed. Both runtimes now live behind the same TorchTensorRTModule, with a single use_python_runtime flag (on CompilationSettings) selecting which path to take at engine setup.
What's changed
One module, two backends. TorchTensorRTModule.setup_engine() constructs either torch.classes.tensorrt.Engine (C++) or TRTEngine (Python) and binds the matching op to self.execute_engine_op. forward() just calls the bound op — no per-iteration branching, no separate module class.
Two equivalent ops. tensorrt::execute_engine (C++) and tensorrt::execute_engine_python (Python) are both registered and have the same signature, fake kernel, and semantics. The Python op is registered unconditionally so it's available even when the C++ runtime is loaded.
Checklist: