English | 简体中文
📖 Documentation | 🤗 Hugging Face | 🤖 ModelScope | 💬 WeChat | 🫨 Discord
- [25/11/05] We have released v0.2. Quantization support for new models, such as
GLM-4.6,Qwen3-VLandQwen3-Omni, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools. - [25/09/30] We have released SpecExit, the reasoning early-exit algorithm: [Paper] | [Docs] | [vLLM Code]🔥🔥🔥
- [25/09/26] We have released TEQUILA, the ternary quantization algorithm [Paper] | [Code]🔥🔥🔥
- [25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource Qwen3-32B-NVFP4 and Qwen3-235B-A22B-NVFP4 weights.
Previous News
- [25/09/01] We now support FP8 quantization of the Hunyuan-MT-7B translation model. And enabled Torch inference and Benchmark evaluation for Eagle3. And implemented support for quantization and Cache for FLUX. And support quantization for the Seed-OSS.
- [25/08/06] We now support quantization for
Hunyuan 0.5B/1.8B/4B/7Band multimodal modelQwen2.5VL 3B/7B/32B/72B, includingFP8/INT4algorithms, and quantization forDeepSeek-R1/V3andKimi-K2, includingFP8-StaticandW4A8-FP8algorithms. We also opensourceHunyuan 1.8B/4B/7Bseries Eagle3 model weight. - [25/07/04] We now support quantization for
Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwenand other models, includingINT8/FP8/INT4algorithms. We also opensourceQwen3series Eagle3 model weight.
- Highly Integrated: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
- Continuous Innovation: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
- Performance-Driven: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
| Scenario | Model | Compression Strategy | ||
|---|---|---|---|---|
| Quantization | Speculative Decoding | Other Techniques | ||
| Large Language Models (LLMs) |
|
|||
| Vision Language Models (VLMs) |
|
|||
| Diffusion Models | - |
|
||
| Speech Models (TTS/ASR) |
|
|||
We recommend using pip to install the latest stable version of AngelSlim:
pip install angelslimAlternatively, you can clone the repository and install from source in editable mode:
cd AngelSlim && python setup.py installFor more detailed installation instructions, please refer to the Installation Documentation.
After installing AngelSlim, you can quickly start Eagle3 training with the following scripts:
# Start the vLLM server
bash scripts/speculative/run_vllm_server.sh
# Generate training data
bash scripts/speculative/generate_data_for_target_model.sh
# Perform online training for the Eagle3 model
bash scripts/speculative/train_eagle3_online.shFor detailed training configurations and vLLM performance benchmarks of Eagle3, please refer to the Quick Start Guide for Speculative Sampling.
After installing AngelSlim, you can launch static FP8 quantization for the Qwen3-1.7B model with the following one-command script:
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yamlThis example produces quantized model weights by performing PTQ calibration on a model loaded from HuggingFace.
Code-based Start
To perform dynamic FP8 quantization on Qwen3-1.7B:
from angelslim.engine import Engine
slim_engine = Engine()
# Prepare model
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
# Initialize compressor
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
# Compress model
slim_engine.run()
# Save compressed model
slim_engine.save("./output")For more details, please refer to the Quick Start Documentation.
Use the scripts/diffusion/run_diffusion.py for quantization and inference:
# Online quantization and inference
python scripts/diffusion/run_diffusion.py \
--model-name-or-path black-forest-labs/FLUX.1-schnell \
--quant-type fp8-per-tensor \
--prompt "A cat holding a sign that says hello world" \
--height 1024 --width 1024 --steps 4 --guidance 0.0 --seed 0For more quantization inference methods, please refer to the Diffusion Model Quantization Documentation.
To test offline inference with a quantized model loaded via transformers, run the following command:
python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"Where MODEL_PATH is the path to the quantized model output.
After specifying the quantized model path MODEL_PATH, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
-
vLLM
Use the following script to launch a vLLM server, recommended version
vllm>=0.8.5.post1. For MOE INT8 quantized models, vllm>=0.9.0 is required.bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096Where
-dis the visible devices,-tis tensor parallel size,-pis pipeline parallel size, and-gis the GPU memory utilization. -
SGLang
Use the following script to launch a SGLang server, recommended version
sglang>=0.4.6.post1.bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
Invoke requests via OpenAI's API format:
bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."where -p is the input prompt.
Evaluate the performance of quantized model using lm-evaluation-harness, recommended versionlm-eval>=0.4.8
Run script details
bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATHwhere RESULT_PATH is the directory for saving test results, -b is batch size, --tasks specifies the evaluation tasks, and -n is the number of few-shot examples.
For more detaileds, please refer to the Deployment Documentation.
We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows, with an accept length of 1.8–3.5 and a maximum speedup of 1.4–1.9×.
Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across MT-bench, HumanEval, GSM8K and Alpaca, using a single NVIDIA H20 GPU (tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024).
| Model | Method | GSM8K | Alpaca | HumanEval | MT-bench | Mean | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | ||
| Qwen3-1.7B | Vanilla | 376.42 | 1 | 378.86 | 1 | 378.38 | 1 | 390.53 | 1 | 381.05 | 1 |
| Eagle3 | 616.9 | 2.13 | 653.29 | 2.19 | 680.1 | 2.2 | 621.44 | 2.17 | 642.93 | 2.17 | |
| Qwen3-4B | Vanilla | 229.05 | 1 | 235.29 | 1 | 234.66 | 1 | 234.04 | 1 | 233.26 | 1 |
| Eagle3 | 389.35 | 2.07 | 395.97 | 2.1 | 377.84 | 2.08 | 384.6 | 2.07 | 386.94 | 2.08 | |
| Qwen3-8B | Vanilla | 149.63 | 1 | 149.93 | 1 | 153.85 | 1 | 153.81 | 1 | 151.81 | 1 |
| Eagle3 | 257.32 | 2 | 266.69 | 2.02 | 244.89 | 1.97 | 258.2 | 1.97 | 257.52 | 1.99 | |
| Qwen3-14B | Vanilla | 92.97 | 1 | 92.66 | 1 | 92.94 | 1 | 94.46 | 1 | 93.26 | 1 |
| Eagle3 | 153.72 | 1.87 | 140.46 | 1.78 | 144.68 | 1.76 | 142.45 | 1.74 | 145.33 | 1.79 | |
| Qwen3-32B | Vanilla | 43.49 | 1 | 43.38 | 1 | 43.19 | 1 | 43.3 | 1 | 43.32 | 1 |
| Eagle3 | 80.43 | 2.01 | 72.49 | 1.9 | 71.57 | 1.86 | 74.1 | 1.86 | 74.1 | 1.91 | |
| Qwen3-30B-A3B | Vanilla | 311.84 | 1 | 320.43 | 1 | 325.77 | 1 | 325.42 | 1 | 320.87 | 1 |
| Eagle3 | 453.97 | 2.1 | 432.45 | 2.04 | 428.81 | 2.02 | 437.06 | 2.01 | 438.07 | 2.04 | |
Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024).
| Model | Method | GSM8K | Alpaca | HumanEval | MT-bench | MATH-500 | MMMU | MMStar | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | throughput (tokens/s) | accept length | ||
| Qwen3-VL-2B-Instruct | Vanilla | 348.55 | 1 | 350.9 | 1 | 346.07 | 1 | 346.31 | 1 | 82.96 | 1 | 83.27 | 1 | 81.63 | 1 |
| Eagle3 | 511.52 | 2.11 | 560.55 | 2.26 | 826.01 | 3.39 | 555.22 | 2.29 | 163.09 | 2.57 | 154.18 | 2.55 | 139.73 | 2.31 | |
| Qwen3-VL-4B-Instruct | Vanilla | 212.87 | 1 | 213.24 | 1 | 211.69 | 1 | 212.1 | 1 | 67.96 | 1 | 65.88 | 1 | 67.75 | 1 |
| Eagle3 | 415.29 | 2.57 | 372.89 | 2.26 | 459.37 | 2.82 | 382.33 | 2.34 | 141.87 | 2.72 | 104.44 | 2.05 | 107.07 | 2.1 | |
| Qwen3-VL-30B-A3B-Instruct | Vanilla | 179.94 | 1 | 184.6 | 1 | 168.68 | 1 | 180.57 | 1 | 31.08 | 1 | 31.51 | 1 | 30.93 | 1 |
| Eagle3 | 281.93 | 2.82 | 241.42 | 2.13 | 223.05 | 2.57 | 240.47 | 2.19 | 75.31 | 2.79 | 48.47 | 1.78 | 52.57 | 1.94 | |
Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024).
| Model | Method | OCR-Bench-Internal | |
|---|---|---|---|
| throughput (tokens/s) | accept length | ||
| Hunyuan-OCR | Vanilla | 71.21 | 1 |
| Eagle3 | 120.75 | 2.2 | |
Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across LibriSpeech dataset, using a single NVIDIA H20 GPU (tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024).
| Model | Method | LibriSpeech | |
|---|---|---|---|
| throughput (tokens/s) | accept length | ||
| Qwen2_Audio | Vanilla | 78.76 | 1 |
| Eagle3 | 146.66 | 3.51 | |
Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across LibriTTS dataset, using a single NVIDIA H20 GPU (tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024).
| Model | Method | LibriTTS | |
|---|---|---|---|
| throughput (tokens/s) | accept length | ||
| Fun-CosyVoice3 | Vanilla | - | 1 |
| Eagle3 | - | 1.96 | |
Adapted for Transformers backend inference, only displays accept length. vLLM speedup ~1.6×, estimated from baseline LLM speedup.
The performance test results for selected models are shown below. For the complete benchmark, refer to the Benchmark documentation
Benchmark results for the Hunyuan-Instruct model with FP8, INT4-AWQ and INT4-GPTQ quantization algorithms on datasets includingOlympiadBench, AIME 2024 and DROP:
| Model | Quantization | OlympiadBench | AIME 2024 | DROP | GPQA-Diamond |
|---|---|---|---|---|---|
| Hunyuan-A13B-Instruct | BF16 | 82.7 | 87.30 | 91.1 | 71.2 |
| FP8-Static | 83.0 | 86.7 | 91.1 | - | |
| Int4-GPTQ | 82.7 | 86.7 | 91.1 | - | |
| Int4-AWQ | 82.6 | 85.6 | 91.0 | - | |
| Hunyuan-7B-Instruct | BF16 | 76.5 | 81.1 | 85.9 | 60.1 |
| FP8-Static | 76.6 | 80.9 | 86.0 | 60.1 | |
| Int4-GPTQ | 76.2 | 81.0 | 85.7 | 60.0 | |
| Int4-AWQ | 76.4 | 80.9 | 85.9 | 60.1 | |
| Hunyuan-4B-Instruct | BF16 | 73.1 | 78.3 | 78.2 | 61.1 |
| FP8-Static | 73.1 | 76.6 | 78.3 | 60.2 | |
| Int4-GPTQ | 72.9 | - | 78.1 | 58.1 | |
| Int4-AWQ | 72.8 | - | 78.2 | - | |
| Hunyuan-1.8B-Instruct | BF16 | 63.4 | 56.7 | 76.7 | 47.2 |
| FP8-Static | 62.5 | 55.2 | 75.1 | 47.7 | |
| Int4-GPTQ | 60.9 | - | 73.0 | 44.4 | |
| Int4-AWQ | 61.7 | - | 71.7 | 43.6 | |
| Hunyuan-0.5B-Instruct | BF16 | 29.6 | 17.2 | 52.8 | 23.3 |
| FP8-Static | 29.6 | 17.2 | 51.6 | 22.5 | |
| Int4-GPTQ | 26.8 | - | 50.9 | 23.3 | |
| Int4-AWQ | 26.3 | - | 48.9 | 23.3 |
Benchmark results for Qwen3 series models with FP8-Static, FP8-Dynamic, INT4-GPTQ, and INT4-AWQ quantization algorithms on datasets including CEVAL, MMLU, GSM8K, and HUMANEVAL:
| Model | Quantization | CEVAL | MMLU | GSM8K | HUMANEVAL |
|---|---|---|---|---|---|
| Qwen3-0.6B | BF16 | 45.84 | 47.21 | 42.99 | 19.51 |
| FP8-Static | 45.99 | 46.87 | 38.06 | 18.90 | |
| FP8-Dynamic | 45.99 | 46.93 | 38.29 | 20.73 | |
| INT8-Dynamic | 45.17 | 46.95 | 41.17 | 21.34 | |
| Qwen3-8B | BF16 | 79.27 | 74.78 | 87.79 | 63.41 |
| FP8-Static | 78.23 | 74.79 | 86.96 | 62.20 | |
| FP8-Dynamic | 78.45 | 74.75 | 87.64 | 62.80 | |
| INT8-Dynamic | 78.01 | 74.84 | 86.96 | 67.07 | |
| INT4-GPTQ | 77.19 | 73.26 | 86.43 | 62.20 | |
| INT4-AWQ | 76.15 | 73.59 | 86.96 | 63.41 | |
| Qwen3-14B | BF16 | 83.06 | 78.90 | 88.40 | 55.49 |
| FP8-Static | 82.62 | 78.57 | 89.46 | 57.32 | |
| FP8-Dynamic | 82.24 | 78.92 | 88.32 | 52.44 | |
| INT8-Dynamic | 81.87 | 78.13 | 86.28 | 56.10 | |
| INT4-GPTQ | 81.05 | 78.02 | 87.34 | 57.93 | |
| INT4-AWQ | 82.02 | 77.68 | 84.23 | 61.59 | |
| Qwen3-32B | BF16 | 86.55 | 82.00 | 74.53 | 37.80 |
| FP8-Static | 86.92 | 81.78 | 70.20 | 39.63 | |
| FP8-Dynamic | 86.55 | 81.89 | 70.43 | 38.41 | |
| INT4-GPTQ | 86.18 | 81.01 | - | 43.29 | |
| INT4-AWQ | 86.18 | 81.54 | - | 36.59 | |
| Qwen3-30B-A3B | BF16 | 83.66 | 79.36 | 89.99 | 31.71 |
| FP8-Static | 83.95 | 79.47 | 89.01 | 31.10 | |
| FP8-Dynamic | 84.10 | 79.40 | 89.16 | 32.93 | |
| INT8-Dynamic | 83.36 | 79.48 | 89.16 | 34.15 | |
| Qwen3-235B-A22B | BF16 | 89.60 | 86.28 | 85.29 | 27.44 |
| FP8-Static | 89.67 | 86.19 | 86.96 | 27.44 | |
| FP8-Dynamic | 89.67 | 86.18 | 85.22 | 28.05 | |
| INT8-Dynamic | 88.93 | 86.20 | 86.20 | 23.78 |
Benchmark results for DeepSeek-R1-0528 series models with FP8-Block-Wise and W4A8-FP8 quantization algorithms on datasets including GPQA Diamond、AIME 2024、SimpleQA and LiveCodeBench:
| Model | Quantization | GPQA Diamond | AIME 2024 | SimpleQA | LiveCodeBench |
|---|---|---|---|---|---|
| DeepSeek-R1-0528 | FP8-Block-Wise | 78.28 | 88.67 | 27.8 | 77.1 |
| W4A8-FP8 | 77.37 | 88.67 | 26.83 | 78.86 |
Note
- The above results are based on the average of 5 test runs deployed with TRT-LLM
- The hyperparameters used during evaluation are as follows:
{ "top_k": 20, "top_p": 0.6, "temperature": 0.7, "output_seq_len": 32768, "max_input_seq_len": 16384 }
Qwen3-VL Benchmark
Benchmark results for Qwen3VL series models with BF16、FP8-Static and FP8-Dynamic quantization algorithms on datasets including MMMU_VAL、DocVQA_VAL and ChartQA_TEST:
| Model | Quantization | MMMU_VAL | DocVQA_VAL | ChartQA_TEST |
|---|---|---|---|---|
| Qwen3-VL-32B-Instruct | BF16 | 60.11 | 96.08 | 94.64 |
| FP8-Static | 61.22 | 96.00 | 94.64 | |
| FP8-Dynamic | 60.78 | 96.19 | 94.72 | |
| Qwen3-VL-30B-A3B-Instruct | BF16 | 50.44 | 95.28 | 95.36 |
| FP8-Dynamic | 50.67 | 95.25 | 95.20 |
Qwen2.5VL Benchmark
Benchmark results for Qwen2.5VL series models with BF16、FP8-Static、FP8-Dynamic、INT4-GPTQ、INT4-AWQ quantization algorithms on datasets including MMMU_VAL、DocVQA_VAL and ChartQA_TEST:
| Model | Quantization | MMMU_VAL | MMLDocVQA_VALU | ChartQA_TEST |
|---|---|---|---|---|
| Qwen2.5VL-3B | BF16 | 47.11 | 78.57 | 80.32 |
| FP8-Static | 47.33 | 79.34 | 79.68 | |
| FP8-Dynamic | 45.99 | 46.93 | 38.29 | |
| INT4-GPTQ | 46.56 | 77.20 | 78.96 | |
| INT4-AWQ | 45.78 | - | 79.60 | |
| Qwen2.5VL-7B | BF16 | 45.44 | 89.71 | 84.64 |
| FP8-Static | 47.00 | 89.83 | 85.92 | |
| FP8-Dynamic | 47.22 | 89.80 | 88.64 | |
| INT4-GPTQ | 46.67 | 90.45 | - | |
| INT4-AWQ | 45.67 | 89.28 | - | |
| Qwen2.5VL-32B | BF16 | 57.00 | 90.03 | - |
| FP8-Static | 57.00 | 89.88 | - | |
| FP8-Dynamic | 56.44 | 89.88 | - | |
| INT4-GPTQ | 55.22 | 89.80 | - | |
| INT4-AWQ | 55.22 | 90.30 | - | |
| Qwen2.5VL-72B | BF16 | 58.78 | 94.39 | 85.60 |
| FP8-Static | 57.89 | 94.41 | 85.84 | |
| FP8-Dynamic | 58.67 | 94.38 | 85.60 | |
| INT4-GPTQ | 57.56 | 94.46 | 86.48 | |
| INT4-AWQ | 58.78 | 94.19 | 87.28 |
Qwen3-Omni Text to Text Benchmark
Benchmark results for Qwen3-Omni series models in BF16, FP8-Static, and FP8-Dynamic on aime25, gpqa_diamond, and mmlu_redux are as follows:
| Model | Quantization | aime25 | gpqa_diamond | mmlu_redux |
|---|---|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct | BF16 | 73.32 | 56.77 | 88.09 |
| FP8-Static | 71.33 | 56.57 | 87.91 | |
| FP8-Dynamic | 73.33 | 55.15 | 88.07 |
Note
- The above evaluation results were obtained by deploying with the vLLM framework and averaging over 5 runs (vLLM only supports the thinker component).
- The hyperparameters used during evaluation are as follows:
{ "top_p": 0.95, "temperature": 0.6, "do_sample": true, "max-model-len 65536": 65536 }
Other models such as GLM-4.6, Qwen2.5, and Seed-OSS have been evaluated on benchmarks like CEVAL, MMLU, and GSM8K using quantization strategies including FP8-Static, FP8-Dynamic, INT4-GPTQ, and INT4-AWQ.
Benchmark Experiment Details
| Model | Quantization | CEVAL | MMLU | GSM8K |
|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct | BF16 | 67.01 | 60.05 | 54.28 |
| FP8-Static | 66.27 | 60.23 | - | |
| FP8-Dynamic | 66.79 | 60.08 | 51.71 | |
| Qwen2.5-7B-Instruct | BF16 | 81.20 | 74.55 | 79.98 |
| FP8-Static | 81.13 | 74.03 | 79.30 | |
| FP8-Dynamic | 80.31 | 74.07 | 79.00 | |
| INT4-GPTQ | 79.05 | 73.05 | 74.75 | |
| INT4-AWQ | 79.35 | 73.22 | 79.38 | |
| Qwen2.5-32B-Instruct | BF16 | 87.30 | 83.21 | 81.73 |
| FP8-Static | 87.59 | 83.08 | 81.58 | |
| FP8-Dynamic | 87.30 | 83.04 | 81.58 | |
| INT4-GPTQ | 86.70 | 82.45 | 82.03 | |
| INT4-AWQ | 87.00 | 82.64 | - | |
| DeepSeek-R1-Distill-Qwen-7B | BF16 | 53.49 | 53.80 | 75.74 |
| FP8-Static | 53.57 | 54.17 | 76.19 | |
| FP8-Dynamic | 52.97 | 54.13 | 74.15 | |
| INT4-GPTQ | 51.86 | 52.44 | 75.89 | |
| INT4-AWQ | 53.49 | 53.70 | - | |
| DeepSeek-R1-Distill-Qwen-14B | BF16 | 77.71 | 74.28 | 85.67 |
| FP8-Static | 77.56 | 74.66 | 86.73 | |
| FP8-Dynamic | 76.82 | 74.63 | 87.11 | |
| INT4-GPTQ | 74.29 | 72.37 | 84.61 | |
| INT4-AWQ | 74.81 | 73.00 | 86.05 | |
| DeepSeek-R1-Distill-Qwen-32B | BF16 | 84.18 | 80.89 | 87.41 |
| FP8-Static | 83.43 | 80.90 | 87.57 | |
| FP8-Dynamic | 83.73 | 81.10 | 86.43 | |
| INT4-GPTQ | 84.10 | 79.80 | 86.73 | |
| INT4-AWQ | 82.84 | 80.15 | 87.19 |
The code for this project is open-sourced under the License for AngelSlim.
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={6},
url={https://github.com/Tencent/AngelSlim},
}
- AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub Issues or join our WeChat discussion group.