Security audit documenting 221 silent int64-to-int32 truncation sites in vLLM's CUDA/C++ extensions that enable GPU buffer overflow via crafted GGUF model files.
This repository contains the full research report documenting a systemic pattern of silent integer truncation across the C++ and CUDA extensions of vLLM, the most widely deployed open-source inference engine for large language model serving.
Key findings:
- 221 instances where PyTorch's 64-bit tensor metadata (
int64_t) is silently narrowed to 32-bitintvariables in vLLM'scsrc/directory - Truncated values are used in GPU buffer allocations, kernel launch parameters, and loop bounds
- For GGUF model file code paths, tensor dimensions are directly attacker-controlled through the model file header
- This is the same vulnerability class that has produced 10 CVEs in llama.cpp and Ollama
- A proof-of-concept attack chain demonstrates how a crafted GGUF file triggers a deterministic GPU buffer overflow
The report covers:
- A formal definition of the weakness class (silent narrowing of attacker-controlled tensor metadata)
- The complete 221-site audit with breakdowns by file and API call
- A proof-of-concept attack chain from crafted GGUF file to GPU buffer overflow
- Comparative CVE evidence from llama.cpp and Ollama
- An argument for formal CWE classification of model-file-sourced memory corruption
- Recommended fixes for vLLM and the broader ML infrastructure ecosystem
The truncation count can be independently verified against vLLM commit 63babd1:
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 63babd1
# tensor.sizes()[idx] -> int (12 sites)
grep -rn 'int [a-z_]* = .*\.sizes()\[' csrc/ --include="*.cu" --include="*.cpp" \
| grep -v int64_t | wc -l
# tensor.size(dim) -> int (159 sites)
grep -rn 'int [a-z_]* = .*\.size(' csrc/ --include="*.cu" --include="*.cpp" \
| grep -v int64_t | grep -v constexpr | wc -l
# tensor.numel() -> int (22 sites)
grep -rn 'int [a-z_]* = .*\.numel()' csrc/ --include="*.cu" --include="*.cpp" \
| grep -v int64_t | wc -l
# tensor.stride(dim) -> int (28 sites)
grep -rn 'int [a-z_]* = .*\.stride(' csrc/ --include="*.cu" --include="*.cpp" \
| grep -v int64_t | wc -l
# Total: 221| Date | Event |
|---|---|
| 2026-03-29 | Identified int64-to-int truncation pattern in vLLM GGUF kernels |
| 2026-03-29 | Completed systemic audit: 221 truncation sites across 20+ files |
| 2026-03-29 | Submitted GHSA-w2f2-mvhx-4xpg (memcpy_to_shm OOB) |
| 2026-03-29 | Submitted GHSA-5jv2-g5wq-cmr4 (int64-to-int truncation) |
| 2026-03-30 | Both GHSAs closed by maintainers |
| 2026-04-01 | Report shared with vLLM security lead for review |
| 2026-04-05 | Public release |
Both findings were submitted through coordinated disclosure via GitHub Security Advisory and were closed by the vLLM maintainers as not meeting their security bar. No embargo applies. The full disclosure narrative is in Section 6 of the report.
This research is part of a broader effort to document memory corruption vulnerabilities across the ML inference stack. Related CVEs in the same vulnerability class:
llama.cpp: CVE-2025-53630, CVE-2026-27940, CVE-2025-49847, CVE-2026-33298, TALOS-2024-1913, TALOS-2024-1914, TALOS-2024-1915
Ollama: CVE-2025-0315, CVE-2025-0317, CVE-2025-66959
A proposal for a new CWE entry (child of CWE-197) specific to integer truncation of tensor metadata in ML inference engines has been submitted to MITRE using this research as supporting evidence.
If you reference this work in your research, please cite:
@techreport{srivastava2026vllm,
title = {221 Silent Integer Truncations in vLLM: How a Single Malicious
Model File Can Corrupt GPU Memory in the Most Widely Deployed
LLM Inference Engine},
author = {Srivastava, Aviral},
year = {2026},
month = {April},
url = {https://github.com/YOUR_USERNAME/vllm-integer-truncation-audit},
note = {Independent security research report}
}Aviral Srivastava -- Independent Security Researcher
- CVE-2026-33017 (Langflow RCE, Critical 9.3, CISA KEV)
- CVE-2026-32628 (AnythingLLM SQL Injection)
- RSA Security Scholar 2025
- NIST OLIR contributor, MITRE CWE/ATLAS contributor
Contact: LinkedIn
This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
You are free to share and adapt this material for any purpose, including commercial, as long as appropriate credit is given.