Releases · InternLM/lmdeploy

10 Feb 06:00

lvhan028

v0.7.0.post3

e98fd6a

v0.7.0.post3 Latest

Latest

What's Changed

💥 Improvements

Set max concurrent requests by @AllentDan in #2961
remove logitswarper by @grimoire in #3109

🐞 Bug fixes

fix user guide about cogvlm deployment by @lvhan028 in #3088
fix postional argument by @lvhan028 in #3086

🌐 Other

[Fix] fix the URL judgment problem in Windows by @Lychee-acaca in #3103
bump version to v0.7.0.post3 by @lvhan028 in #3115

New Contributors

@Lychee-acaca made their first contribution in #3103

Full Changelog: v0.7.0.post2...v0.7.0.post3

Contributors

grimoire, lvhan028, and 2 other contributors

Assets 12

27 Jan 15:57

lvhan028

v0.7.0.post2

637435f

LMDeploy Release V0.7.0.post2

What's Changed

💥 Improvements

Add deepseek-r1 chat template by @AllentDan in #3072
Update tokenizer by @lvhan028 in #3061

🐞 Bug fixes

Add system role to deepseek chat template by @AllentDan in #3031
Fix xcomposer2d5 by @irexyc in #3087

🌐 Other

bump version to v0.7.0.post2 by @lvhan028 in #3094

Full Changelog: v0.7.0.post1...v0.7.0.post2

Contributors

lvhan028, irexyc, and AllentDan

Assets 12

25 Jan 11:35

lvhan028

v0.7.0.post1

552bf3a

LMDeploy Release V0.7.0.post1

What's Changed

💥 Improvements

use weights iterator while loading by @RunningLeon in #2886

🐞 Bug fixes

[dlinfer] fix ascend qwen2_vl graph_mode by @yao-fengchen in #3045
fix error in interactive api by @lvhan028 in #3074
fix sliding window mgr by @grimoire in #3068
More arguments in api_client, update docstrings by @AllentDan in #3077

🌐 Other

[ci] add internlm3 into testcase by @zhulinJulia24 in #3038
add internlm3 to supported models by @lvhan028 in #3041
update pre-commit config by @lvhan028 in #2683
[maca] add cudagraph support on maca backend. by @Reinerzhou in #2834
bump version to v0.7.0.post1 by @lvhan028 in #3076

Full Changelog: v0.7.0...v0.7.0.post1

Contributors

grimoire, lvhan028, and 5 other contributors

Assets 12

15 Jan 10:04

lvhan028

v0.7.0

9fcb3b1

LMDeploy Release v0.7.0

What's Changed

🚀 Features

Support moe w8a8 in pytorch engine by @grimoire in #2894
Support DeepseekV3 fp8 by @grimoire in #2967
support new backend cambricon by @JackWeiw in #3002
support-moe-fp8 by @RunningLeon in #3007
add internlm3-dense(turbomind) & chat template by @irexyc in #3024
support internlm3 on pt by @RunningLeon in #3026
Support internlm3 quantization by @AllentDan in #3027

💥 Improvements

Optimize awq kernel in pytorch engine by @grimoire in #2965
Support fp8 w8a8 for pt backend by @RunningLeon in #2959
Optimize lora kernel by @grimoire in #2975
Remove threadsafe by @grimoire in #2907
Refactor async engine & turbomind IO by @lzhangzz in #2968
[dlinfer]rope refine by @JackWeiw in #2984
Expose spaces_between_special_tokens by @AllentDan in #2991
[dlinfer]change llm op interface of paged_prefill_attention. by @JackWeiw in #2977
Update request logger by @lvhan028 in #2981
remove decoding by @grimoire in #3016

🐞 Bug fixes

Fix build crash in nvcr.io/nvidia/pytorch:24.06-py3 image by @zgjja in #2964
add tool role in BaseChatTemplate as tool response in messages by @AllentDan in #2979
Fix ascend dockerfile by @jinminxi104 in #2989
fix internvl2 qk norm by @grimoire in #2987
fix xcomposer2 when transformers is upgraded greater than 4.46 by @irexyc in #3001
Fix get_ppl & get_logits by @lvhan028 in #3008
Fix typo in w4a16 guide by @Yan-Xiangjun in #3018
fix blocked fp8 moe kernel by @grimoire in #3009
Fix async engine by @lzhangzz in #3029
[hotfix] Fix get_ppl by @lvhan028 in #3023
Fix MoE gating for DeepSeek V2 by @lzhangzz in #3030
Fix empty response for pipeline by @lzhangzz in #3034
Fix potential hang during TP model initialization by @lzhangzz in #3033

🌐 Other

[ci] add w8a8 and internvl2.5 models into testcase by @zhulinJulia24 in #2949
bump version to v0.7.0 by @lvhan028 in #3010

New Contributors

@zgjja made their first contribution in #2964
@Yan-Xiangjun made their first contribution in #3018

Full Changelog: 0.6.5...v0.7.0

Contributors

grimoire, lvhan028, and 9 other contributors

Assets 12

30 Dec 10:15

lvhan028

0.6.5

af0fcf2

LMDeploy Release v0.6.5

What's Changed

🚀 Features

[dlinfer] feat: add DlinferFlashAttention to support qwen vl. by @Reinerzhou in #2952

💥 Improvements

refactor PyTorchEngine check env by @grimoire in #2870
refine multi-backend setup.py by @jinminxi104 in #2880
Refactor VLM modules by @lvhan028 in #2810
[dlinfer] only compile the language model in vl models by @tangzhiyi11 in #2893
Optimize tp broadcast by @grimoire in #2889
unfeeze torch version in dockerfile by @RunningLeon in #2906
support tp > n_kv_heads for pt engine by @RunningLeon in #2872
replicate kv for some models when tp is divisble by kv_head_num by @irexyc in #2874
Fallback to pytorch engine when the model is quantized by smooth quant by @lvhan028 in #2953
Torchrun launching multiple api_server by @AllentDan in #2402

🐞 Bug fixes

[Feature] Support for loading lora adapter weights in safetensors format by @Galaxy-Husky in #2860
fix cpu cache by @grimoire in #2881
Fix args type in docstring by @Galaxy-Husky in #2888
Fix llama3.1 chat template by @fzyzcjy in #2862
Fix typo by @ghntd in #2916
fix: Incorrect stats size during inference of throughput benchmark when concurrency > num_prompts by @pancak3 in #2928
fix lora name and rearange wqkv for internlm2 by @RunningLeon in #2912
[dlinfer] fix moe op for dlinfer. by @Reinerzhou in #2917
[side effect] fix vlm quant failed by @lvhan028 in #2914
fix torch_dtype by @RunningLeon in #2933
support unaligned qkv heads by @grimoire in #2930
fix mllama inference without image by @RunningLeon in #2947
Support torch_dtype modification and update FAQs for AWQ quantization by @AllentDan in #2898
Fix exception handler for proxy server by @AllentDan in #2901
Fix torch_dtype in lite by @AllentDan in #2956
[side-effect] bring back quantization of qwen2-vl, glm4v and etc. by @lvhan028 in #2954
add a thread pool executor to control the vl engine traffic by @lvhan028 in #2970
[side-effect] fix gradio demo error by @lvhan028 in #2976

🌐 Other

[dlinfer] fix engine checker by @tangzhiyi11 in #2891
Bump version to v0.6.5 by @lvhan028 in #2955

New Contributors

@Galaxy-Husky made their first contribution in #2860
@fzyzcjy made their first contribution in #2862
@ghntd made their first contribution in #2916
@pancak3 made their first contribution in #2928

Full Changelog: v0.6.4...0.6.5

Contributors

grimoire, lvhan028, and 10 other contributors

Assets 12

09 Dec 12:08

lvhan028

v0.6.4

14b64c7

LMDeploy Release v0.6.4

What's Changed

🚀 Features

feature: support qwen2.5 fuction_call by @akai-shuuichi in #2737
[Feature] support minicpm-v_2_6 for pytorch engine. by @Reinerzhou in #2767
Support qwen2-vl AWQ quantization by @AllentDan in #2787
Add DeepSeek-V2 support by @lzhangzz in #2763
[ascend]feat: support kv int8 by @yao-fengchen in #2736

💥 Improvements

Optimize update_step_ctx on Ascend by @jinminxi104 in #2804
Add Ascend installation adapter by @zhabuye in #2817
Refactor turbomind (2/N) by @lzhangzz in #2818
add openssh-server installation in dockerfile by @lvhan028 in #2830
Add version restrictions in runtime_ascend.txt to ensure functionality by @zhabuye in #2836
better kv allocate by @grimoire in #2814
Update internvl chat template by @AllentDan in #2832
profile throughput without new threads by @grimoire in #2826
[dlinfer] change dlinfer kv_cache layout and ajust paged_prefill_attention api. by @Reinerzhou in #2847
[maca] add env to support different mm layout on maca. by @Reinerzhou in #2835
Supports W8A8 quantization for more models by @AllentDan in #2850

🐞 Bug fixes

disable prefix-caching for vl model by @grimoire in #2825
Fix gemma2 accuracy through the correct softcapping logic by @AllentDan in #2842
fix accessing before initialization by @lvhan028 in #2845
fix the logic to verify whether AutoAWQ has been successfully installed by @grimoire in #2844
check whether backend_config is None or not before accessing its attr by @lvhan028 in #2848
[ascend] convert kv cache to nd format in ascend graph mode by @tangzhiyi11 in #2853

📚 Documentations

Update supported models & Ascend doc by @jinminxi104 in #2765
update supported models by @lvhan028 in #2849

🌐 Other

[CI] Split vl testcases into turbomind and pytorch backend by @zhulinJulia24 in #2751
[dlinfer] Fix qwenvl rope error for dlinfer backend by @JackWeiw in #2795
[CI] add more testcase for mllm models by @zhulinJulia24 in #2791
Update dlinfer-ascend version in runtime_ascend.txt by @jinminxi104 in #2865
bump version to v0.6.4 by @lvhan028 in #2864

New Contributors

@akai-shuuichi made their first contribution in #2737
@JackWeiw made their first contribution in #2795
@zhabuye made their first contribution in #2817

Full Changelog: v0.6.3...v0.6.4

Contributors

grimoire, lvhan028, and 10 other contributors

Assets 12

16 Nov 04:31

lvhan028

v0.6.3

0c80baa

LMDeploy Release V0.6.3

What's Changed

🚀 Features

support yarn in turbomind backend by @irexyc in #2519
add linear op on dlinfer platform by @yao-fengchen in #2627
support turbomind head_dim 64 by @irexyc in #2715
[Feature]: support LlavaForConditionalGeneration with turbomind inference by @deepindeed2022 in #2710
Support Mono-InternVL with PyTorch backend by @wzk1015 in #2727
Support Qwen2-MoE models by @lzhangzz in #2723
Support mixtral moe AWQ quantization. by @AllentDan in #2725
Support chemvlm by @RunningLeon in #2738
Support molmo in turbomind by @lvhan028 in #2716

💥 Improvements

Call cuda empty_cache to prevent OOM when quantizing model by @AllentDan in #2671
feat: support dynamic/llama3 rotary embedding in ascend graph mode by @tangzhiyi11 in #2670
Add ensure_ascii = False for json.dumps by @AllentDan in #2707
Flatten cache and add flashattention by @grimoire in #2676
Support ep, column major moe kernel. by @grimoire in #2690
Remove one of the duplicate bos tokens by @AllentDan in #2708
Check server input by @irexyc in #2719
optimize dlinfer moe by @tangzhiyi11 in #2741

🐞 Bug fixes

Support min_tokens, min_p parameters for api_server by @AllentDan in #2681
fix index error when computing ppl on long-text prompt by @lvhan028 in #2697
Better tp exit log. by @grimoire in #2677
miss to read moe_ffn weights from converted tm model by @lvhan028 in #2698
Fix turbomind TP by @lzhangzz in #2706
fix decoding kernel for deepseekv2 by @grimoire in #2688
fix tp exit code for pytorch engine by @RunningLeon in #2718
fix assert pad >= 0 failed when inter_size is not a multiple of group… by @Vinkle-hzt in #2740
fix issue that mono-internvl failed to fallback pytorch engine by @lvhan028 in #2744
Remove use_fast=True when loading tokenizer for lite auto_awq by @AllentDan in #2758
set wrong head_dim for mistral-nemo by @lvhan028 in #2761

📚 Documentations

Update ascend readme by @jinminxi104 in #2756
fix ascend get_started.md link by @CyCle1024 in #2696
Fix llama3.2 VL vision in "Supported Modals" documents by @blankanswer in #2703

🌐 Other

[ci] support v100 dailytest by @zhulinJulia24 in #2665
[ci] add more testcase into evaluation and daily test by @zhulinJulia24 in #2721
feat: support multi cards in ascend graph mode by @tangzhiyi11 in #2755
bump version to v0.6.3 by @lvhan028 in #2754

New Contributors

@blankanswer made their first contribution in #2703
@tangzhiyi11 made their first contribution in #2670
@wzk1015 made their first contribution in #2727
@Vinkle-hzt made their first contribution in #2740

Full Changelog: v0.6.2...v0.6.3

Contributors

grimoire, lvhan028, and 13 other contributors

Assets 12

07 Nov 07:41

lvhan028

v0.6.2.post1

4fc9479

LMDeploy Release v0.6.2.post1

What's Changed

Bugs

Fix llama3.2 VL vision in "Supported Modals" documents @blankanswer in #2703
miss to read moe_ffn weights from converted tm model @lvhan028 in #2698
better tp exit log @grimoire in #2677
fix index error when computing ppl on long-text prompt @lvhan028 in #2697
Support min_tokens, min_p parameters for api_server @AllentDan in 2681
fix ascend get_started.md link @CyCle1024 in #2696
Call cuda empty_cache to prevent OOM when quantizing model @AllentDan in #2671
Fix turbomind TP for v0.6.2 by @lzhangzz in #2713

🌐 Other

[ci] support v100 dailytest (https://github.com/InternLM/lmdeploy/pull/2665[)](https://github.com/InternLM/lmdeploy/commit/434195ea0c80b38dc2cf80c79d53a30f22b53aab)
bump version to 0.6.2.post1 by @lvhan028 in #2717

Full Changelog: v0.6.2...v0.6.2.post1

Contributors

grimoire, lvhan028, and 4 other contributors

Assets 12

29 Oct 06:42

lvhan028

v0.6.2

522108c

LMDeploy Release v0.6.2

Highlights

PyTorch engine supports graph mode on ascend platform, doubling the inference speed
Support llama3.2-vision models in PyTorch engine
Support Mixtral in TurboMind engine, achieving 20+ RPS using SharedGPT dataset with 2 A100-80G GPUs

What's Changed

🚀 Features

support downloading models from openmind_hub by @cookieyyds in #2563
Support pytorch engine kv int4/int8 quantization by @AllentDan in #2438
feat(ascend): support w4a16 by @yao-fengchen in #2587
[maca] add maca backend support. by @Reinerzhou in #2636
Support mllama for pytorch engine by @AllentDan in #2605
add --eager-mode to cli by @RunningLeon in #2645
[ascend] add ascend graph mode by @CyCle1024 in #2647
MoE support for turbomind by @lzhangzz in #2621

💥 Improvements

[Feature] Add argument to disable FastAPI docs by @mouweng in #2540
add check for device with cap 7.x by @grimoire in #2535
Add tool role for langchain usage by @AllentDan in #2558
Fix llama3.2-1b inference error by handling tie_word_embedding by @grimoire in #2568
Add a workaround for saving internvl2 with latest transformers by @AllentDan in #2583
optimize paged attention on triton3 by @grimoire in #2553
refactor for multi backends in dlinfer by @CyCle1024 in #2619
Copy sglang/bench_serving.py to lmdeploy as serving benchmark script by @lvhan028 in #2620
Add barrier to prevent TP nccl kernel waiting. by @grimoire in #2607
[ascend] refactor fused_moe on ascend platform by @yao-fengchen in #2613
[ascend] support paged_prefill_attn when batch > 1 by @yao-fengchen in #2612
Raise an error for the wrong chat template by @AllentDan in #2618
refine pre-post-process by @jinminxi104 in #2632
small block_m for sm7.x by @grimoire in #2626
update check for triton by @grimoire in #2641
Support llama3.2 LLM models in turbomind engine by @lvhan028 in #2596
Check whether device support bfloat16 by @lvhan028 in #2653
Add warning message about do_sample to alert BC by @lvhan028 in #2654
update ascend dockerfile by @CyCle1024 in #2661
fix supported model list in ascend graph mode by @jinminxi104 in #2669
remove dlinfer version by @CyCle1024 in #2672

🐞 Bug fixes

set outlines<0.1.0 by @AllentDan in #2559
fix: make exit_flag verification for ascend more general by @CyCle1024 in #2588
set capture mode thread_local by @grimoire in #2560
Add distributed context in pytorch engine to support torchrun by @grimoire in #2615
Fix error in python3.8. by @Reinerzhou in #2646
Align UT with triton fill_kv_cache_quant kernel by @AllentDan in #2644
miss device_type when checking is_bf16_supported on ascend platform by @lvhan028 in #2663
fix syntax in Dockerfile_aarch64_ascend by @CyCle1024 in #2664
Set history_cross_kv_seqlens to 0 by default by @AllentDan in #2666
fix build error in ascend dockerfile by @CyCle1024 in #2667
bugfix: llava-hf/llava-interleave-qwen-7b-hf (#2497) by @deepindeed2022 in #2657
fix inference mode error for qwen2-vl by @irexyc in #2668

📚 Documentations

Add instruction for downloading models from openmind hub by @cookieyyds in #2577
Fix spacing in ascend user guide by @Superskyyy in #2601
Update get_started tutorial about deploying on ascend platform by @jinminxi104 in #2655
Update ascend get_started tutorial about installing nnal by @jinminxi104 in #2662

🌐 Other

[ci] add oc infer test in stable test by @zhulinJulia24 in #2523
update copyright by @lvhan028 in #2579
[Doc]: Lock sphinx version by @RunningLeon in #2594
[ci] use local requirements for test workflow by @zhulinJulia24 in #2569
[ci] add pytorch kvint testcase into function regresstion by @zhulinJulia24 in #2584
[ci] React dailytest workflow by @zhulinJulia24 in #2617
[ci] fix restful script by @zhulinJulia24 in #2635
[ci] add internlm2_5_7b_batch_1 into evaluation testcase by @zhulinJulia24 in #2631
match torch and torch_vision version by @grimoire in #2649
Bump version to v0.6.2 by @lvhan028 in #2659

New Contributors

@mouweng made their first contribution in #2540
@cookieyyds made their first contribution in #2563
@Superskyyy made their first contribution in #2601
@Reinerzhou made their first contribution in #2636
@deepindeed2022 made their first contribution in #2657

Full Changelog: v0.6.1...v0.6.2

Contributors

grimoire, lvhan028, and 13 other contributors

Assets 12

28 Sep 11:34

lvhan028

v0.6.1

2e49fc3

LMDeploy Release V0.6.1

What's Changed

🚀 Features

Support user-sepcified data type by @lvhan028 in #2473
Support minicpm3-4b by @AllentDan in #2465
support Qwen2-VL with pytorch backend by @irexyc in #2449

💥 Improvements

Add silu mul kernel by @grimoire in #2469
adjust schedule to improve TTFT in pytorch engine by @grimoire in #2477
Add max_log_len option to control length of printed log by @lvhan028 in #2478
set served model name being repo_id from hub before it is downloaded by @lvhan028 in #2494
Improve proxy server usage by @AllentDan in #2488
CudaGraph mixin by @grimoire in #2485
pytorch engine add get_logits by @grimoire in #2487
Refactor lora by @grimoire in #2466
support noaligned silu_and_mul by @grimoire in #2506
optimize performance of ascend backend's update_step_context() by calculating kv_start_indices in a new way by @jiajie-yang in #2521
Fix chatglm tokenizer failed when transformers>=4.45.0 by @AllentDan in #2520

🐞 Bug fixes

Fix "TypeError: Got unsupported ScalarType BFloat16" by @SeitaroShinagawa in #2472
fix ascend atten_mask by @yao-fengchen in #2483
Catch exceptions thrown by turbomind inference thread by @lvhan028 in #2502
The get_ppl missed the last token of each iteration during multi-iter prefill by @lvhan028 in #2499
fix vl gradio by @irexyc in #2527

🌐 Other

[ci] regular update by @zhulinJulia24 in #2431
[CI] add base model evaluation by @zhulinJulia24 in #2490
bump version to v0.6.1 by @lvhan028 in #2513

New Contributors

@SeitaroShinagawa made their first contribution in #2472

Full Changelog: v0.6.0...v0.6.1

Contributors

grimoire, lvhan028, and 6 other contributors

Assets 12

Releases: InternLM/lmdeploy

v0.7.0.post3

What's Changed

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.7.0.post2

What's Changed

💥 Improvements

🐞 Bug fixes

🌐 Other

Contributors

LMDeploy Release V0.7.0.post1

What's Changed

💥 Improvements

🐞 Bug fixes

🌐 Other

Contributors

LMDeploy Release v0.7.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

LMDeploy Release v0.6.5

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

LMDeploy Release v0.6.4

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.6.3

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release v0.6.2.post1

What's Changed

Bugs

🌐 Other

Contributors

LMDeploy Release v0.6.2

Highlights

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.6.1

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors