You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been exploring various optimization techniques for quantization and bit-packing as part of my work on hyper-quantization (very early WIP).
One simple yet effective optimization is replacing subtraction with type reinterpretation, a zero-cost operation.
This approach is reminiscent of the infamous Quake "evil inverse square root" algorithm, where floating-point numbers are reinterpreted as integers for efficient manipulation—serving as a key inspiration for this technique.
Overview:
In symmetric quantization, we compute: $y = q \times \text{scale}$
where $q$ is a signed integer and $\text{scale}$ is typically an fp16 value per block. Packing small signed integers efficiently is crucial for reducing memory bandwidth and improving performance.
The Challenge of Bit-Packing:
Packing low-bit signed integers (e.g., 4-bit values) into bytes is non-trivial because the sign bit is separated from the magnitude bits (s000xyz for 4-bit). Traditional approaches require explicit bias subtraction, adding overhead during dequantization.
A common method adds a bias (e.g., +8) to shift the 4-bit signed range (-8 to 7) into an unsigned range (0 to 15). Two values are packed per byte, requiring subtraction during unpacking:
Instead of storing a biased unsigned integer and subtracting it back during dequantization, we can pre-scale the quantized value by $2^n$ (for $n$-bit quantization) and store it in a format that allows direct reinterpretation as a signed integer.
For Q4_0, we multiply the quantized value by $2^4 = 16$ and divide the scale by the same factor. This results in a packed format where the lower $8-n$ bits are zero (s0000xyz$\to$sxyz0000 for 4-bit values). The reinterpretation (bit_cast<int8_t>) preserves the bit pattern without triggering sign extension.
type reinterpretations (std::bit_cast<int8_t>) – 2 operations – zero cost!
Note: The underlying ( q ) value here is different, as it has been pre-multiplied by 16 and type-cast to uint8, instead of being offset by 8 and cast to int8.
CUDA Implementation
In CUDA, reinterpret_cast<int8_t*> can be used for efficient reinterpretation at the register level, eliminating any extra instructions.
Total Cost Comparison (Original vs. Optimized)
Approach
AND (&)
Shift (>>, <<)
Subtraction (-8)
Type Cast
Original (Q4_0)
✅
✅
✅ (2x)
❌
Optimized (Bit-Trick)
✅
✅
❌
✅ (Zero-Cost)
Generalization to Any Bit-Width (1–7 bits)
This optimization works for any$n$-bit quantization format where $n < 8$. The general rule is: $q_{\text{stored}} = q_{\text{original}} \times 2^n$ $\text{scale}_{\text{new}} = \frac{\text{scale}}{2^n}$
This ensures that packed numbers have their lower $8-n$ bits set to zero, enabling reinterpretation instead of arithmetic correction.
Instead of using the lower $n$ bits, we use the higher $n$ bits, and the logic of the code remains the same—only the direction of the bit shifts (left ↔ right) is swapped.
Summary
✅ Replaces subtraction with bit reinterpretation (bit_cast<int8_t>), which is zero-cost.
✅ Works for any 1–7 bit quantization format with simple pre-scaling.
✅ Eliminates unnecessary ALU operations, improving efficiency on both CPUs and CUDA GPUs.
✅ Leads to faster and more efficient dequantization, particularly for inference workloads.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been exploring various optimization techniques for quantization and bit-packing as part of my work on hyper-quantization (very early WIP).
One simple yet effective optimization is replacing subtraction with type reinterpretation, a zero-cost operation.
This approach is reminiscent of the infamous Quake "evil inverse square root" algorithm, where floating-point numbers are reinterpreted as integers for efficient manipulation—serving as a key inspiration for this technique.
Overview:
In symmetric quantization, we compute:
$y = q \times \text{scale}$
where$q$ is a signed integer and $\text{scale}$ is typically an
fp16
value per block. Packing small signed integers efficiently is crucial for reducing memory bandwidth and improving performance.The Challenge of Bit-Packing:
Packing low-bit signed integers (e.g., 4-bit values) into bytes is non-trivial because the sign bit is separated from the magnitude bits (
s000xyz
for 4-bit). Traditional approaches require explicit bias subtraction, adding overhead during dequantization.Standard Q4_0 Approach
A common method adds a bias (e.g., +8) to shift the 4-bit signed range (-8 to 7) into an unsigned range (0 to 15). Two values are packed per byte, requiring subtraction during unpacking:
🔴 Cost Breakdown of This Approach:
& 0x0F
) – 1 operation>> 4
) – 1 operation-8
) – 2 operationsOptimized Approach: Pre-scaling + Zero-Cost Reinterpretation
Instead of storing a biased unsigned integer and subtracting it back during dequantization, we can pre-scale the quantized value by$2^n$ (for $n$ -bit quantization) and store it in a format that allows direct reinterpretation as a signed integer.
For Q4_0, we multiply the quantized value by$2^4 = 16$ and divide the scale by the same factor. This results in a packed format where the lower $8-n$ bits are zero ($\to$
s0000xyz
sxyz0000
for 4-bit values). The reinterpretation (bit_cast<int8_t>
) preserves the bit pattern without triggering sign extension.🔴 Cost Breakdown of This Approach:
& 0xF0
) – 1 operation<< 4
) – 1 operationstd::bit_cast<int8_t>
) – 2 operations – zero cost!Note: The underlying ( q ) value here is different, as it has been pre-multiplied by 16 and type-cast to
uint8
, instead of being offset by 8 and cast toint8
.CUDA Implementation
In CUDA,
reinterpret_cast<int8_t*>
can be used for efficient reinterpretation at the register level, eliminating any extra instructions.Total Cost Comparison (Original vs. Optimized)
&
)>>
,<<
)-8
)Generalization to Any Bit-Width (1–7 bits)
This optimization works for any$n$ -bit quantization format where $n < 8$ . The general rule is:
$q_{\text{stored}} = q_{\text{original}} \times 2^n$
$\text{scale}_{\text{new}} = \frac{\text{scale}}{2^n}$
This ensures that packed numbers have their lower$8-n$ bits set to zero, enabling reinterpretation instead of arithmetic correction.$n$ bits, we use the higher $n$ bits, and the logic of the code remains the same—only the direction of the bit shifts (left ↔ right) is swapped.
Instead of using the lower
Summary
✅ Replaces subtraction with bit reinterpretation (
bit_cast<int8_t>
), which is zero-cost.✅ Works for any 1–7 bit quantization format with simple pre-scaling.
✅ Eliminates unnecessary ALU operations, improving efficiency on both CPUs and CUDA GPUs.
✅ Leads to faster and more efficient dequantization, particularly for inference workloads.
Would love to hear feedback! 🚀
Beta Was this translation helpful? Give feedback.
All reactions