You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there! I've made some improvements to the AWQ quantization implementation.
Here's a summary of what I did:
1. **Resolved Tensor Dimension Mismatch (Primary Fix):**
* I modified how activation statistics are collected to compute scales per input channel (the hidden dimension) instead of a single value for the entire layer. These scales are now 1D tensors shaped like `(in_features,)`.
* I updated the quantization process to average these per-channel scales (if you're using multiple calibration batches) before they are used to quantize each layer.
* This fixed the `RuntimeError: The size of tensor a (768) must match the size of tensor b (6) at non-singleton dimension 0` that was happening when scaling weights.
2. **Enhanced Core Implementation:**
* **Memory Management:** I added explicit cleanup for large, temporary tensors in the quantization and statistics collection steps. I also ensured memory is cleared strategically.
* **Device Placement:** I confirmed that tensors are consistently handled and moved to the correct device for important operations.
* **Error Handling:** I added specific error checks during weight scaling and grouping, which will now raise a `RuntimeError` with detailed shape information if something goes wrong, making it easier for you to debug.
* **Batch Processing:** I reviewed and confirmed the efficiency of how calibration data is processed in batches.
* **Verification:** I introduced a new set of checks to verify:
* Correct computation of per-channel activation scales.
* Successful application of these scales during layer quantization without dimension errors.
* The ability to perform a forward pass on a model quantized with AWQ.
These changes should make the AWQ quantization process more robust, memory-efficient, and accurate.
0 commit comments