-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does llmcompressor support hybrid sparsity? #1037
Comments
Hi @jiangjiadi, Thank you for raising this question and for your interest in llm-compressor! The short answer is yes—llm-compressor does support a hybrid compression approach, enabling you to apply 2:4 sparsity to certain layers while using standard quantization for others. The key to achieving this lies in the configuration of your recipe. Below is an example of how such a recipe might look: pruning_stage:
obcq_modifiers:
SparseGPTModifier:
sparsity: 0.5
sequential_update: true
mask_structure: "2:4"
targets: ['re:model.layers.1.*$'] # Applies 2:4 sparsity to the first layer
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"] # Excludes specific layers from quantization
scheme: "FP8_DYNAMIC"
targets: ["Linear"] # Applies quantization to all linear layers
pruning_modifiers:
ConstantPruningModifier:
targets: [
're:.*q_proj.weight',
're:.*k_proj.weight',
're:.*v_proj.weight',
're:.*o_proj.weight',
're:.*gate_proj.weight',
're:.*up_proj.weight',
're:.*down_proj.weight',
]
start: 0 # Ensures sparsity is retained during quantization Key Notes:
Feel free to experiment with these configurations or let us know if you encounter any issues while implementing this setup. We're here to help! |
Hi @rahul-tuli, I got the following error when using the recipe above. |
@rahul-tuli After modifying the recipe as follows, I did indeed obtain a model:
However, I am not sure if the model is actually quantized in a mixed manner. I have inspected the compressed tensors and found there is no difference in format between the tensors in mlp layer (e.g. model.layers.0.mlp.down_proj.weight) and that in attention layer (e.g. model.layers.0.self_attn.k_proj.weight). I also test the inference speed of this hybrid quantized model and the fp8 model, and there is no significant difference between them. |
Hi @rahul-tuli, I delved into vllm's handling logic for the partially 2:4 sparse quantized model and discovered that vllm indeed processes the 2:4 sparse layers as regular quantized layers. This PR vllm-project/vllm#11889 addresses this issue. |
Hi @rahul-tuli, I attempted to replace the quantization method with int4, using the recipe as below. However, when creating the quantized model, I encountered an error. Recipe:
|
Is your feature request related to a problem? Please describe.
I've found that the model's performance is constrained when 2:4 sparsity is applied to all linear layers. However, the performance improves significantly when only some layers are subjected to 2:4 sparsity.
Describe the solution you'd like
Does llmcompressor support this hybrid compression format? Specifically, compressing the linear layers that satisfy 2:4 sparsity in the 2:4 format, and compressing the other linear layers that do not meet the 2:4 sparsity criteria using the standard quantization format.
The text was updated successfully, but these errors were encountered: