🚀 Optimized Bit-Packing: Replacing Subtraction with Zero-Cost Type Reinterpretation #11576

tfernd · 2025-02-01T15:40:43Z

tfernd
Feb 1, 2025

I've been exploring various optimization techniques for quantization and bit-packing as part of my work on hyper-quantization (very early WIP).

One simple yet effective optimization is replacing subtraction with type reinterpretation, a zero-cost operation.

This approach is reminiscent of the infamous Quake "evil inverse square root" algorithm, where floating-point numbers are reinterpreted as integers for efficient manipulation—serving as a key inspiration for this technique.

Overview:

In symmetric quantization, we compute:
$y = q \times \text{scale}$

where $q$ is a signed integer and $\text{scale}$ is typically an fp16 value per block. Packing small signed integers efficiently is crucial for reducing memory bandwidth and improving performance.

The Challenge of Bit-Packing:

Packing low-bit signed integers (e.g., 4-bit values) into bytes is non-trivial because the sign bit is separated from the magnitude bits (s000xyz for 4-bit). Traditional approaches require explicit bias subtraction, adding overhead during dequantization.

Standard Q4_0 Approach

A common method adds a bias (e.g., +8) to shift the 4-bit signed range (-8 to 7) into an unsigned range (0 to 15). Two values are packed per byte, requiring subtraction during unpacking:

for (int j = 0; j < qk/2; ++j) {
    const int x0 = (x[i].qs[j] & 0b00001111) - 8;
    const int x1 = (x[i].qs[j] >> 4) - 8;
    y[i*qk + j + 0   ] = x0 * d;
    y[i*qk + j + qk/2] = x1 * d;
}

🔴 Cost Breakdown of This Approach:

Logical AND (& 0x0F) – 1 operation
Shift Right (>> 4) – 1 operation
Subtractions (-8) – 2 operations

Optimized Approach: Pre-scaling + Zero-Cost Reinterpretation

Instead of storing a biased unsigned integer and subtracting it back during dequantization, we can pre-scale the quantized value by $2^n$ (for $n$-bit quantization) and store it in a format that allows direct reinterpretation as a signed integer.

For Q4_0, we multiply the quantized value by $2^4 = 16$ and divide the scale by the same factor. This results in a packed format where the lower $8-n$ bits are zero (s0000xyz $\to$ sxyz0000 for 4-bit values). The reinterpretation (bit_cast<int8_t>) preserves the bit pattern without triggering sign extension.

for (int j = 0; j < qk/2; ++j) {
    const int8_t x0 = std::bit_cast<int8_t>(x[i].qs[j] & 0b11110000);
    const int8_t x1 = std::bit_cast<int8_t>(x[i].qs[j] << 4);
    y[i*qk + j + 0   ] = x0 * d;
    y[i*qk + j + qk/2] = x1 * d;
}

🔴 Cost Breakdown of This Approach:

Logical AND (& 0xF0) – 1 operation
Shift Left (<< 4) – 1 operation
type reinterpretations (std::bit_cast<int8_t>) – 2 operations – zero cost!

Note: The underlying ( q ) value here is different, as it has been pre-multiplied by 16 and type-cast to uint8, instead of being offset by 8 and cast to int8.

CUDA Implementation

In CUDA, reinterpret_cast<int8_t*> can be used for efficient reinterpretation at the register level, eliminating any extra instructions.

Total Cost Comparison (Original vs. Optimized)

Approach	AND (`&`)	Shift (`>>`, `<<`)	Subtraction (`-8`)	Type Cast
Original (Q4_0)	✅	✅	✅ (2x)	❌
Optimized (Bit-Trick)	✅	✅	❌	✅ (Zero-Cost)

Generalization to Any Bit-Width (1–7 bits)

This optimization works for any $n$-bit quantization format where $n < 8$. The general rule is:
$q_{\text{stored}} = q_{\text{original}} \times 2^n$
$\text{scale}_{\text{new}} = \frac{\text{scale}}{2^n}$

This ensures that packed numbers have their lower $8-n$ bits set to zero, enabling reinterpretation instead of arithmetic correction.
Instead of using the lower $n$ bits, we use the higher $n$ bits, and the logic of the code remains the same—only the direction of the bit shifts (left ↔ right) is swapped.

Summary

✅ Replaces subtraction with bit reinterpretation (bit_cast<int8_t>), which is zero-cost.
✅ Works for any 1–7 bit quantization format with simple pre-scaling.
✅ Eliminates unnecessary ALU operations, improving efficiency on both CPUs and CUDA GPUs.
✅ Leads to faster and more efficient dequantization, particularly for inference workloads.

Would love to hear feedback! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Optimized Bit-Packing: Replacing Subtraction with Zero-Cost Type Reinterpretation #11576

{{title}}

Replies: 0 comments

Select a reply

🚀 Optimized Bit-Packing: Replacing Subtraction with Zero-Cost Type Reinterpretation #11576

tfernd Feb 1, 2025

Overview:

The Challenge of Bit-Packing:

Standard Q4_0 Approach

Optimized Approach: Pre-scaling + Zero-Cost Reinterpretation

CUDA Implementation

Total Cost Comparison (Original vs. Optimized)

Generalization to Any Bit-Width (1–7 bits)

Summary

Replies: 0 comments

tfernd
Feb 1, 2025