Benchmark Result
Per-clip minADE (m)
| # |
Clip |
ONNX-TRT |
Torch-TRT |
Δ |
| 1 |
0043b781… |
0.1719 |
0.1721 |
+0.0002 |
| 2 |
00ee8960… |
0.9539 |
0.9582 |
+0.0043 |
| 3 |
0145f6e0… |
0.3925 |
0.3929 |
+0.0004 |
| 4 |
01460b45… |
0.0876 |
0.0874 |
−0.0002 |
| 5 |
01cf8186… |
0.8914 |
0.7777 |
−0.1137 |
| 6 |
021a4585… |
0.0976 |
0.1003 |
+0.0027 |
| 7 |
0347d9f9… |
1.5423 |
1.5467 |
+0.0044 |
| 8 |
03aa0b51… |
1.3659 |
1.3488 |
−0.0171 |
| 9 |
0455a6e4… |
0.2546 |
0.2537 |
−0.0009 |
| 10 |
047c0263… |
0.5622 |
0.5603 |
−0.0019 |
| 11 |
049445b3… |
0.4437 |
0.4453 |
+0.0016 |
|
Average |
0.6149 |
0.6039 |
−0.0110 |
Stage timings (mean of clips 2–11, ms; clip 1 excluded as warm-up)
| Stage |
ONNX-TRT |
Torch-TRT |
Δ |
Δ% |
| ViT |
364.0 |
365.4 |
+1.4 |
+0.4% |
| LLM Prefill |
890.7 |
875.0 |
−15.7 |
−1.8% |
| LLM Generation |
146.5 |
188.7 |
+42.2 |
+28.8% |
| Diffusor |
169.9 |
173.5 |
+3.6 |
+2.1% |
| E2E |
1571.9 |
1602.6 |
+30.7 |
+2.0% |
Engine sizes
| Engine |
ONNX-TRT (MiB) |
Torch-TRT (MiB) |
Δ |
| LLM |
14484 |
14495 |
+11 |
| Visual |
1106 |
1114 |
+8 |
| Action |
4357 |
4380 |
+23 |
Shared execution context (per-runner workspace, bytes)
| Runner |
ONNX-TRT |
Torch-TRT |
Ratio |
| LLM |
2,776,893,952 |
2,776,893,952 |
1.0× |
| Vision |
4,074,504,192 |
4,074,505,216 |
1.0× |
| Action |
265,029,632 |
2,025,095,168 |
~7.6× |
Peak shared exec context is bounded by the vision runner (~4.07 GB) in both cases, so peak GPU memory is unchanged. Only the action runner's reserved workspace balloons under Torch-TRT.
Verdict
- Accuracy: equivalent on average (Torch-TRT actually 1 cm better on this 11-clip subset). Per-clip differences are sub-cm except for clip 5, where Torch-TRT happens to be 11 cm closer.
- Throughput: Torch-TRT is ~2% slower end-to-end. The regression is concentrated in LLM decode (+29%); prefill is actually marginally faster.
- Memory: action-runner workspace ~7.6× larger under Torch-TRT (peak GPU memory unchanged because vision dominates).
Benchmark Result
Per-clip minADE (m)
0043b781…00ee8960…0145f6e0…01460b45…01cf8186…021a4585…0347d9f9…03aa0b51…0455a6e4…047c0263…049445b3…Stage timings (mean of clips 2–11, ms; clip 1 excluded as warm-up)
Engine sizes
Shared execution context (per-runner workspace, bytes)
Verdict