Add support for bitnets to ORT WebGPU EP #25587

sushraja-msft · 2025-07-30T01:50:06Z

Description

This change introduces support for BitNet models—specifically microsoft/bitnet-b1.58-2B-4T-bf16—by adding a new operator, BitLinear, which handles the unique weight format used by BitNets for matrix multiplication.

Converted onnx model for testing purposes - https://huggingface.co/sushraja/bitnet-b1.58-2B-4T-fp16-onnx

Motivation and Context

BitNets significantly reduce memory usage due to their compact parameter representation, making them well-suited for client-side inference scenarios.

BitLinear Operator

BitNets encode matrix weights as ternary values (+1, 0, -1) along with a scale factor. These ternary values are represented in base-3 and packed such that 5 weights fit into a single uint8.

Inference Workflow

The inference process involves two main steps:

Step 1: Quantization of Input A

Input tensor A (in fp16) is quantized to int8 with a single scale per token.
Four int8 values are packed into a u32.
Every 5th value of A is extracted and stored in a separate tensor (A5) to align with the BitNet weight packing.
Result: For every 20 values of A, you get:
- A vec4 (4 × 4 packed values)
- One u32 for the 5th values

Step 2: Multiplication with Weights B

Weights B are stored transposed as a stream of uint8s, each encoding 5 ternary weights.
Each uint8 is unpacked using a lookup table into:
- A u32 containing 4 packed weights
- 1 extra weight (the 5th), which is collected across 4 uint8s into a separateu32
This results in:
- A vec4 from the packed weights
- One u32 from the extra weights
  These are then multiplied using the DP4A instruction, leveraging shared memory for efficient cooperative matmul.

Key Notes

The BitLinear operator does not enforce a specific layout for B.
Weights are stored using ternary packing (5 weights per uint8), and decompression is handled dynamically at runtime.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-07-30T01:56:33Z

onnxruntime/contrib_ops/webgpu/quantization/bitlinear.cc

+
+  // Step 1: Quantize input A using BitLinearQuantizeProgram
+  const uint32_t quantize_output_size = (M * (K_PADDED - (K_PADDED / kWeightsPerByte)) / 4);  // skipping every 5th, packed into u32
+  const uint32_t quantize_5th_output_size = M * K_PADDED / kQuantizationBlockSize;   // every 5th element packed int u32


Suggested change

const uint32_t quantize_5th_output_size = M * K_PADDED / kQuantizationBlockSize; // every 5th element packed int u32

const uint32_t quantize_5th_output_size = M * K_PADDED / kQuantizationBlockSize; // every 5th element packed int u32

github-actions · 2025-07-30T01:56:33Z

onnxruntime/contrib_ops/webgpu/quantization/bitlinear.h

+ public:
+  BitLinearQuantizeProgram(uint32_t k, uint32_t k_padded) : Program{"BitLinearQuantize"}, K_(k), K_PADDED_(k_padded) {}
+
+  Status GenerateShaderCode(ShaderHelper& sh) const override;


Suggested change

Status GenerateShaderCode(ShaderHelper& sh) const override;

Status GenerateShaderCode(ShaderHelper& sh) const override;

onnxruntime/contrib_ops/webgpu/quantization/bitlinear.cc

@@ -0,0 +1,131 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


onnxruntime/contrib_ops/webgpu/quantization/bitlinear.h

@@ -0,0 +1,56 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


sushraja-msft · 2025-07-30T02:17:37Z

Going to work on tests in the next couple of days. Sharing this PR for early review on operator shape and any feedback.

sushraja-msft added 4 commits July 24, 2025 21:56

Initial bitlinear support, produces garbage response

7295073

Bitlinear works for 1 node.

4ce8455

support k not divisible by 20

f9d867b

fix the out of bounds access

b7d3257

sushraja-msft requested review from guschmue and qjia7 July 30, 2025 01:51

github-actions bot reviewed Jul 30, 2025

View reviewed changes

github-advanced-security bot found potential problems Jul 30, 2025

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for bitnets to ORT WebGPU EP #25587

Add support for bitnets to ORT WebGPU EP #25587

sushraja-msft commented Jul 30, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Jul 30, 2025

Uh oh!

github-actions bot Jul 30, 2025

Uh oh!

Check warning

Check warning

sushraja-msft commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

	const uint32_t quantize_5th_output_size = M * K_PADDED / kQuantizationBlockSize; // every 5th element packed int u32
	const uint32_t quantize_5th_output_size = M * K_PADDED / kQuantizationBlockSize; // every 5th element packed int u32

	Status GenerateShaderCode(ShaderHelper& sh) const override;
	Status GenerateShaderCode(ShaderHelper& sh) const override;

		@@ -0,0 +1,131 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

		@@ -0,0 +1,56 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

Add support for bitnets to ORT WebGPU EP #25587

Are you sure you want to change the base?

Add support for bitnets to ORT WebGPU EP #25587

Conversation

sushraja-msft commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

BitLinear Operator

Inference Workflow

Step 1: Quantization of Input A

Step 2: Multiplication with Weights B

Key Notes

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Check warning

Check warning

sushraja-msft commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sushraja-msft commented Jul 30, 2025 •

edited

Loading

sushraja-msft commented Jul 30, 2025 •

edited

Loading