-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Quantized_model + float LoRA model scenario to model builder #1043
base: main
Are you sure you want to change the base?
Conversation
@@ -437,7 +441,7 @@ def save_model(self, out_dir): | |||
# Quantize ONNX model to desired precision | |||
# TODO: Replace by quantizing the MatMuls as they are created | |||
already_quantized_in_qdq_format = self.quant_type is not None and self.quant_attrs["use_qdq"] # Skip quantizing `MatMul` in `DequantizeLinear --> Transpose --> MatMul` path | |||
if self.onnx_dtype == "int4" and not already_quantized_in_qdq_format: | |||
if self.onnx_dtype == "int4" and not already_quantized_in_qdq_format and not self.matmul_attrs["use_lora"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatMul4bits quantizer has an option to excludes nodes for quantization https://github.com/microsoft/onnxruntime/blob/e7987a6b0ba429c0bec248c4a471e1782da4be6c/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L1342
Maybe instead of a flag, you can keep a set of the lora matmul names and provide it to the quantizer? Otherwise, if the user provides float base + float adapters with int4
as precision, the output model will be fully float. But you might want to quantize the base model?
also for a quantized base model + float adapters, you might want to quantize the lm head like #940? Not sure what effect always quantizing the lm head has on accuracy though.
@@ -334,23 +369,51 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in | |||
# model.layers.layer_id.mlp.dense_h_to_4h.bias | |||
module.mlp.gate_proj.bias = tensor[: intermediate_size] | |||
module.mlp.down_proj.bias = tensor[intermediate_size: ] | |||
elif bool(re.match(r"^model.layers\.\d+\.self_attn.q_proj.lora_A\.weight$", name)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these only cover llama type models. phi3 has qkv_proj and gate_up_proj instead of q,k,v,gate_proj,up_proj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Kunal also mentioned about that. We can support for that as well
Add Quantized_model + float LoRA model scenario to model builder