create a new accuracy eval script for official README.md eval accuracy #3449

vkuzo · 2025-12-05T19:58:29Z

Summary:

Creates a standalone eval script for generating accuracy metrics for
quantization README.md, based on the HuggingFace model definition of
LLaMa 3.1 8B

Why new script?

the current prod script in
https://github.com/pytorch/ao/blob/main/torchao/_models/llama/eval.py
uses a custom model definition, this was pre-HF integration, it's better to use HF's model definition now
we have HummingBird scripts in
https://github.com/pytorch/ao/tree/40c4f44677ae11166c3dcfbb9189cfa78789390c/.github/scripts/torchao_model_releases,
but they seem pretty verbose and hard to use/modify
we have
https://github.com/pytorch/ao/blob/main/benchmarks/_models/eval_hf_models.py,
I copy-pasted and modified this for the current PR. The script above
didn't work as is for various reasons, and also seemed to be hard to
use/modify, for main README.md it's important to have a very simple
standalone script.

We should probably do a pass on the naming before landing.

Future work:

add metrics for int4_weight_only_hqq (need to run on A100)
add metrics for 'int4 weight float8 activation' (currently doesn't work with HF accelerate)
add metrics for mxfp8 and nvfp4 (need to run on B200)
make the parsing of logs automated
also add a similar script for performance benchmarks, using vllm
delete https://github.com/pytorch/ao/blob/main/torchao/_models/llama/

Test Plan:

// debug run on small model
with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh facebook/opt-125m

// real run
with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-12-05T19:58:30Z

Stack from ghstack (oldest at bottom):

-> create a new accuracy eval script for official README.md eval accuracy #3449

pytorch-bot · 2025-12-05T19:58:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3449

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Replace all macOS instances with nextjs due CVE-2025-55182

❌ 1 New Failure

As of commit 002ba19 with merge base 69ce0fd ():

NEW FAILURE - The following job has failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Creates a standalone eval script for generating accuracy metrics for quantization README.md, based on the HuggingFace model definition of LLaMa 3.1 8B Why new script? 1. the current `prod` script in https://github.com/pytorch/ao/blob/main/torchao/_models/llama/eval.py uses a custom model definition, this was pre-HF integration, it's better to use HF's model definition now 2. we have HummingBird scripts in https://github.com/pytorch/ao/tree/40c4f44677ae11166c3dcfbb9189cfa78789390c/.github/scripts/torchao_model_releases, but they seem pretty verbose and hard to use/modify 3. we have https://github.com/pytorch/ao/blob/main/benchmarks/_models/eval_hf_models.py, I copy-pasted and modified this for the current PR. The script above didn't work as is for various reasons, and also seemed to be hard to use/modify, for main README.md it's important to have a very simple standalone script. We should probably do a pass on the naming before landing. Future work: 1. add metrics for `int4_weight_only_hqq` (need to run on A100) 2. add metrics for `mxfp8` and `nvfp4` (need to run on B200) 3. make the parsing of logs automated 4. also add a similar script for performance benchmarks, using vllm 5. delete https://github.com/pytorch/ao/blob/main/torchao/_models/llama/ Test Plan: ``` // debug run on small model with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh facebook/opt-125m // real run with-proxy time ./benchmarks/quantization/eval_accuracy_for_readme.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 39c1d72 ghstack-comment-id: 3618394399 Pull-Request: #3449

Update

002ba19

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

create a new accuracy eval script for official README.md eval accuracy #3449

create a new accuracy eval script for official README.md eval accuracy #3449

Uh oh!

vkuzo commented Dec 5, 2025 •

edited

Loading

Uh oh!

vkuzo commented Dec 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

create a new accuracy eval script for official README.md eval accuracy #3449

Are you sure you want to change the base?

create a new accuracy eval script for official README.md eval accuracy #3449

Uh oh!

Conversation

vkuzo commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3449

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Dec 5, 2025 •

edited

Loading

vkuzo commented Dec 5, 2025 •

edited

Loading

pytorch-bot bot commented Dec 5, 2025 •

edited

Loading