Skip to content

Tech review of the vllm quantization LP#3307

Open
pareenaverma wants to merge 3 commits into
ArmDeveloperEcosystem:mainfrom
pareenaverma:content_review
Open

Tech review of the vllm quantization LP#3307
pareenaverma wants to merge 3 commits into
ArmDeveloperEcosystem:mainfrom
pareenaverma:content_review

Conversation

@pareenaverma
Copy link
Copy Markdown
Contributor

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • [x ] I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • [x ] I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

| - stem | 2|none | 0|acc |↑ |0.6053|± |0.0345|
```

The INT8 model scores 0.6614 on MMLU compared to 0.6895 for BF16 — a drop of approximately 3%, which is consistent with the expected accuracy cost of INT8 weight quantization. For full reference results, see the [Red Hat model card](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#accuracy).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reference point to claim that a 3% drop is expected with INT8? I think its slightly wrong to say this specially with limit 10 benchmarking

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point can remove this 3% drop but still keep the output reference for the reader to know what to expect from running it. Same phrasing instead as previous comment

```

We expect INT8 inference to show a slight accuracy drop compared to BF16. For reference results and expected accuracy differences, see the Red Hat model card: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#accuracy
The output is similar to:
Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO its better to not post accuracy numbers unless we have run full benchmarking on certain tasks because these numbers would heavily change based on what prompts gets randomly picked up for benchmarking/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add some phrasing out that addresses that concern like The output above shows the format you can expect from lm_eval. At --limit 10, only 10 prompts are randomly selected per task, so the specific values will vary significantly between runs and are not a reliable basis for comparison. For published full-dataset accuracy figures, see the Red Hat model card

lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto
```

The output is similar to:
Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO its better to not post accuracy numbers unless we have run full benchmarking on certain tasks because these numbers would heavily change based on what prompts gets randomly picked up for benchmarking.
I understand the point of showcasing a representative output table though.

### Accuracy recovery: INT8/BF16 (--limit 10)
| MMLU | GSM8k |
|---|---|
| 97% | 92% |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated these values without any limit applied. If you generated these yourself with that arg then no issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants