Skip to content

feat: TorchTRT Cuda generated kernels plugin support#4199

Open
bowang007 wants to merge 1 commit intomainfrom
tta_cuda_plugin
Open

feat: TorchTRT Cuda generated kernels plugin support#4199
bowang007 wants to merge 1 commit intomainfrom
tta_cuda_plugin

Conversation

@bowang007
Copy link
Copy Markdown
Collaborator

@bowang007 bowang007 commented Apr 21, 2026

Description

This PR introduces torch_tensorrt.annotation, an experimental module for registering hand-written CUDA C++ kernels as both PyTorch custom ops (for eager execution) and TensorRT Quick Deployable Plugins with AOT support (for torch_tensorrt.compile).

Usage

  import torch, torch_tensorrt
  import torch_tensorrt.kernels as ttk                                                                                                                           
   
  CU = """                                                                                                                                                          
  extern "C" __global__ void my_sigmoid(const float* x, int n, float* y) {
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      if (i < n) y[i] = 1.0f / (1.0f + __expf(-x[i]));                                                                                                              
  }
  """                                                                                                                                                               
                  
ttk.cuda_kernel_op(
    "kern_ex::sigmoid",
    ttk.KernelSpec(
        kernel_source=CU_SIGMOID,
        kernel_name="my_sigmoid",
        inputs=[ttk.InputDecl("x")],
        outputs=[ttk.OutputDecl("y", shape=ttk.SameAs("x"))],
        extras=[ttk.Numel("x")],
        geometry=ttk.Elementwise(block=(256,), layout="flat"),
    ),
    supports_dynamic_shapes=True,
)

After this call, torch.ops.ann_ex.sigmoid is available in eager and is embedded as a TensorRT plugin during torch_tensorrt.compile. The meta function, eager
launch, AOT implementation, and PyTorch schema are all derived from the KernelSpec.

API Surface

The module exposes two primary entry points, layered by declarativeness:

auto_cuda_kernel_plugin is the recommended default. The caller supplies a KernelSpec dataclass describing the kernel's inputs, outputs (with a shape relation such
as SameAs or ReduceDims), scalar extras (Numel, DimSize), and launch geometry (Elementwise or Reduction). The framework derives the meta function, eager CUDA
launch, TensorRT AOT implementation, and PyTorch schema. This path covers pointwise kernels (1-D flat or N-D grid launches), reductions (with optional keepdim),
multi-input kernels, and scalar (non-tensor) kernel arguments via ScalarInput.

manual_cuda_kernel_plugin is the lower-level alternative for kernels outside the declarative DSL — shape-changing outputs, multi-output kernels, or non-standard
launch geometries. The caller provides eager_fn and aot_fn directly; the decorator still registers the PyTorch op, TRT plugin, AOT implementation, and converter
in a single call.

A Custom(fn=...) geometry is also available for callers who want the declarative path's schema/meta derivation but need to hand-write the TRT KernelLaunchParams.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@meta-cla meta-cla Bot added the cla signed label Apr 21, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API labels Apr 21, 2026
@github-actions github-actions Bot requested a review from lanluo-nvidia April 21, 2026 16:55
github-actions[bot]

This comment was marked as outdated.

@bowang007 bowang007 marked this pull request as draft April 21, 2026 16:56
@github-actions github-actions Bot added the component: build system Issues re: Build system label Apr 22, 2026
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 22, 2026
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

@bowang007 bowang007 requested a review from narendasan April 22, 2026 18:06
@bowang007 bowang007 marked this pull request as ready for review April 22, 2026 18:09
Comment thread examples/dynamo/custom_cuda_kernel_op.py
Comment thread examples/dynamo/manual_cuda_kernel_plugin_annotation.py Outdated
Copy link
Copy Markdown
Collaborator

@narendasan narendasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I think this is getting really close. I think we just have a few naming things to make this more user friendly and I think we should let users provide PTX directly in addition to the cuda apis. Also did you add nvrtc as an optional dependency in the pyproject.toml) (maybe under an a extras called kernels)?

Comment thread docsrc/py_api/annotation.rst Outdated
Comment thread examples/dynamo/auto_cuda_kernel_plugin_annotation.py Outdated
Comment thread examples/dynamo/auto_cuda_kernel_plugin_annotation.py Outdated
Comment thread examples/dynamo/manual_cuda_kernel_plugin_annotation.py Outdated
Comment thread py/torch_tensorrt/annotation/_custom_plugin/_descriptor.py Outdated
Comment thread py/torch_tensorrt/kernels/_custom_plugin/__init__.py
# Numel("x") pass x.numel() to the kernel as an int extra.
# Elementwise(flat) 1-D launch over the flattened output; any input rank works.

tta.auto_cuda_kernel_plugin(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can call this something like torch_tensorrt.kernels.cuda_kernel_op

Comment thread examples/dynamo/manual_cuda_kernel_plugin_annotation.py Outdated
Comment thread py/torch_tensorrt/kernels/_custom_plugin/__init__.py Outdated
@bowang007 bowang007 changed the title feat: TorchTRT Annotation Layer for Cuda generated kernels feat: TorchTRT Cuda generated kernels plugin support May 11, 2026
@github-actions github-actions Bot added component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels May 11, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/plugins/_generate_plugin_converter.py	2026-05-11 16:39:32.423025+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/plugins/_generate_plugin_converter.py	2026-05-11 16:39:55.856400+00:00
@@ -36,11 +36,13 @@


def _coerce_plugin_attr_for_qdp(value: Any, attr_annotation: Any) -> Any:
    """Convert Python scalars to the serialized type expected by QDP."""
    if _is_numpy_attr_annotation(attr_annotation):
-        return np.asarray(_unwrap_scalar_attr(value), dtype=_numpy_attr_dtype(attr_annotation))
+        return np.asarray(
+            _unwrap_scalar_attr(value), dtype=_numpy_attr_dtype(attr_annotation)
+        )
    return value


def _is_numpy_attr_annotation(annotation: Any) -> bool:
    return annotation is np.ndarray or typing.get_origin(annotation) is np.ndarray
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/kernels/_kernel_plugin.py	2026-05-11 16:39:32.436980+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/kernels/_kernel_plugin.py	2026-05-11 16:39:57.457501+00:00
@@ -371,16 +371,14 @@
            annotations[d.name] = d.py_type
        else:
            sig_pieces.append(f"{d.name}: 'torch.Tensor'")
            annotations[d.name] = torch.Tensor
    sig_src = ", ".join(sig_pieces)
-    body = textwrap.dedent(
-        f"""
+    body = textwrap.dedent(f"""
        def _wrapper({sig_src}) -> 'torch.Tensor':
            return _fn({", ".join(param_names)})
-        """
-    )
+        """)
    ns: Dict[str, Any] = {"_fn": fn, "torch": torch}
    exec(compile(body, "<cuda_kernel_op>", "exec"), ns)
    wrapper: Callable[..., Any] = ns["_wrapper"]
    wrapper.__annotations__ = dict(annotations)
    wrapper.__annotations__["return"] = torch.Tensor
@@ -713,8 +711,6 @@
        precompiled_ptx=ptx,
        use_aot_if_available=not any(
            isinstance(input_spec, ScalarInput) for input_spec in spec.inputs
        ),
    )
-    _LOGGER.info(
-        "cuda_kernel_op '%s' registered (schema: %s)", op_name, schema
-    )
+    _LOGGER.info("cuda_kernel_op '%s' registered (schema: %s)", op_name, schema)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: build system Issues re: Build system component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: tests Issues re: Tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants