[ET-VK][Ops] linear_qta8a_qga4w_qta8o test framework #12005

ahmtox · 2025-06-26T16:56:21Z

Stack from ghstack (oldest at bottom):

Context

This test framework establishes the foundation for validating the linear_qta8a_qga4w_qta8o operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:

qta8a: Quantized per-token affine 8-bit activation inputs
qga4w: Quantized per-group affine 4-bit weights
qta8o: Quantized per-token affine 8-bit outputs

Changes

The reference implementation (linear_qta8a_qga4w_qta8o_4bit_dequant_impl) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation (quantized_input.to(at::kFloat) - input_zero_point) * input_scale. After dequantization, the implementation performs standard floating point linear operation at::linear(x_float, weights_dequantized), then manually quantizes the result using at::round(linear_result / output_scale) + output_zero_point with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: D77173442

# Context This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware. The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic. This operator nomenclature breakdown: - **qta8a**: Quantized per-token affine 8-bit activation inputs - **qga4w**: Quantized per-group affine 4-bit weights - **qta8o**: Quantized per-token affine 8-bit outputs # Changes The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated. Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/) [ghstack-poisoned]

pytorch-bot · 2025-06-26T16:56:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12005

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

✅ No Failures

As of commit 2e1cbf0 with merge base 85cf6ce ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-06-26T16:56:31Z

This pull request was exported from Phabricator. Differential Revision: D77173442

ahmtox requested a review from SS-JIA as a code owner June 26, 2025 16:56

ahmtox mentioned this pull request Jun 26, 2025

[ET-VK][Ops] linear_qta8a_qga4w_qta8o impl and shaders #12006

Open

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2025

facebook-github-bot added the fb-exported label Jun 26, 2025

ahmtox mentioned this pull request Jun 26, 2025

[ET-VK] benchmarking linear_qta8a_qga4w_qta8o #12007

Closed

ahmtox added the release notes: vulkan Changes to the Vulkan backend delegate label Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK][Ops] linear_qta8a_qga4w_qta8o test framework #12005

[ET-VK][Ops] linear_qta8a_qga4w_qta8o test framework #12005

Uh oh!

ahmtox commented Jun 26, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 26, 2025

Uh oh!

Uh oh!

[ET-VK][Ops] linear_qta8a_qga4w_qta8o test framework #12005

Are you sure you want to change the base?

[ET-VK][Ops] linear_qta8a_qga4w_qta8o test framework #12005

Uh oh!

Conversation

ahmtox commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Uh oh!

pytorch-bot bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12005

❗ 1 Active SEVs

✅ No Failures

Uh oh!

facebook-github-bot commented Jun 26, 2025

Uh oh!

Uh oh!

ahmtox commented Jun 26, 2025 •

edited

Loading

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading