Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Documentation on LUT Quantization Theory and Generation Methods for LUT_Biases and LUT_Scales #67

Open
zhouexellent opened this issue Oct 29, 2024 · 1 comment
Labels
question Further information is requested

Comments

@zhouexellent
Copy link

Hello,thank you for your outstanding work!

We couldn’t locate any theoretical insights or perspectives related to LUT quantization in the referenced paper, and while reading through the source code, we also couldn’t find specific details on the generation method for the LUT_Biases matrix. Could you please provide relevant literature references or suggest keywords for further research?

Additionally, I believe there may be some potential issues in the method used to generate the LUT_Scales matrix. Specifically:

(1) Taking the absolute value first and then summing.
(2) Summing first and then taking the absolute value.
It seems that approach (1) might be slightly more appropriate than approach (2).

source code (specifically located at ./python/t_mac/ops/qgemm.py)
`
LUT_Scales = te.compute(
(N, K // self.act_group_size),
lambda n, kk: te.max(
te.abs(sum(B[n, kk * self.act_group_size + sk * self.g + g] for g in range(self.g))) / self.maxv,
axis=sk,
),
name="LUT_Scales",
)

    LUT_Biases = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Biases")


if self.has_lut_scale:
LUT_Scales = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Scales")
LUT_Biases = te.placeholder((N, K // self.act_group_size), dtype=self.out_dtype, name="LUT_Biases")
def _lut_scale(n, k, val):
return val * LUT_Scales[n, k * self.g // self.act_group_size] + LUT_Biases[n, k * self.g // self.act_group_size] * alphas[0]

Scales = te.placeholder(scales_shape, dtype=self.out_dtype, name="Scales")

if self.m_groups == -1:
if K % self.group_size != 0:
raise TVMError("K({}) must be devisible by group_size({})".format(K, self.group_size))
if self.zero_point:
scales_shape = (M // bm, K // self.group_size, bm // self.bits * 2)
def _get_scale(m, k):
# Fake _get_scale, should be tensorized
return Scales[m // bm, k * self.g // self.group_size, (m % bm) // self.bits * 2] - Scales[m // bm, k * self.g // self.group_size, (m % bm) // self.bits * 2 + 1]
`

Thank you for your assistance!

@kaleid-liner kaleid-liner added the question Further information is requested label Oct 29, 2024
@kaleid-liner
Copy link
Collaborator

  • We have compared the accuracy of INT8 LUT quantization with llama.cpp Q8_0 group-wise activation quantization from both kernel-wise and model-wise in Sec 5.6 of our paper.
  • The LUT quantization in Python is just placeholder computation. Please refer to lut_ctor.cc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants