You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We couldn’t locate any theoretical insights or perspectives related to LUT quantization in the referenced paper, and while reading through the source code, we also couldn’t find specific details on the generation method for the LUT_Biases matrix. Could you please provide relevant literature references or suggest keywords for further research?
Additionally, I believe there may be some potential issues in the method used to generate the LUT_Scales matrix. Specifically:
(1) Taking the absolute value first and then summing.
(2) Summing first and then taking the absolute value.
It seems that approach (1) might be slightly more appropriate than approach (2).
source code (specifically located at ./python/t_mac/ops/qgemm.py)
`
LUT_Scales = te.compute(
(N, K // self.act_group_size),
lambda n, kk: te.max(
te.abs(sum(B[n, kk * self.act_group_size + sk * self.g + g] for g in range(self.g))) / self.maxv,
axis=sk,
),
name="LUT_Scales",
)
LUT_Biases = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Biases")
if self.has_lut_scale:
LUT_Scales = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Scales")
LUT_Biases = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Biases")
def _lut_scale(n, k, val):
return val * LUT_Scales[n, k * self.g // self.act_group_size] + LUT_Biases[n, k * self.g // self.act_group_size] * alphas[0]
Scales = te.placeholder(scales_shape, dtype=self.out_dtype, name="Scales")
if self.m_groups == -1:
if K % self.group_size != 0:
raise TVMError("K({}) must be devisible by group_size({})".format(K, self.group_size))
if self.zero_point:
scales_shape = (M // bm, K // self.group_size, bm // self.bits * 2)
def _get_scale(m, k):
# Fake _get_scale, should be tensorized
return Scales[m // bm, k * self.g // self.group_size, (m % bm) // self.bits * 2] - Scales[m // bm, k * self.g // self.group_size, (m % bm) // self.bits * 2 + 1]
`
Thank you for your assistance!
The text was updated successfully, but these errors were encountered:
We have compared the accuracy of INT8 LUT quantization with llama.cpp Q8_0 group-wise activation quantization from both kernel-wise and model-wise in Sec 5.6 of our paper.
The LUT quantization in Python is just placeholder computation. Please refer to lut_ctor.cc.
Hello,thank you for your outstanding work!
We couldn’t locate any theoretical insights or perspectives related to LUT quantization in the referenced paper, and while reading through the source code, we also couldn’t find specific details on the generation method for the
LUT_Biases
matrix. Could you please provide relevant literature references or suggest keywords for further research?Additionally, I believe there may be some potential issues in the method used to generate the
LUT_Scales
matrix. Specifically:(1) Taking the absolute value first and then summing.
(2) Summing first and then taking the absolute value.
It seems that approach (1) might be slightly more appropriate than approach (2).
source code (specifically located at ./python/t_mac/ops/qgemm.py)
`
LUT_Scales = te.compute(
(N, K // self.act_group_size),
lambda n, kk: te.max(
te.abs(sum(B[n, kk * self.act_group_size + sk * self.g + g] for g in range(self.g))) / self.maxv,
axis=sk,
),
name="LUT_Scales",
)
if self.has_lut_scale:
LUT_Scales = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Scales")
LUT_Biases = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Biases")
def _lut_scale(n, k, val):
return val * LUT_Scales[n, k * self.g // self.act_group_size] + LUT_Biases[n, k * self.g // self.act_group_size] * alphas[0]
Scales = te.placeholder(scales_shape, dtype=self.out_dtype, name="Scales")
if self.m_groups == -1:
if K % self.group_size != 0:
raise TVMError("K({}) must be devisible by group_size({})".format(K, self.group_size))
if self.zero_point:
scales_shape = (M // bm, K // self.group_size, bm // self.bits * 2)
def _get_scale(m, k):
# Fake _get_scale, should be tensorized
return Scales[m // bm, k * self.g // self.group_size, (m % bm) // self.bits * 2] - Scales[m // bm, k * self.g // self.group_size, (m % bm) // self.bits * 2 + 1]
`
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: