Releases: huggingface/text-generation-inference
v3.3.2
Gaudi improvements.
What's Changed
- upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in #3239
- Nix: switch to hf-nix by @danieldk in #3240
- Add Qwen3 by @yuanwu2017 in #3229
- fp8 compressed_tensors w8a8 support by @sywangyi in #3242
- [Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in #3245
- Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in #3246
- Prepare for 3.3.2 by @danieldk in #3249
Full Changelog: v3.3.1...v3.3.2
v3.3.1
This release updates TGI to Torch 2.7 and CUDA 12.8.
What's Changed
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #3217
- adjust the
round_up_seq
logic to align with prefill warmup phase on⦠by @kaixuanliu in #3224 - Update to Torch 2.7.0 by @danieldk in #3221
- Enable Llama4 for gaudi backend by @yuanwu2017 in #3223
- fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in #3230
- Deepseek r1 by @sywangyi in #3211
- Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in #3234
- fix the crash in default ATTENTION path by @sywangyi in #3235
- Switch to punica-sgmv kernel from the Hub by @danieldk in #3236
- move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in #3237
- Prepare for 3.3.1 by @danieldk in #3238
New Contributors
- @kaixuanliu made their first contribution in #3217
Full Changelog: v3.3.0...v3.3.1
v3.3.0
Notable changes
- Prefill chunking for VLMs.
What's Changed
- Fixing Qwen 2.5 VL (32B). by @Narsil in #3157
- Fixing tokenization like https://github.com/huggingface/text-embeddin⦠by @Narsil in #3156
- Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in #3113
- L4 fixes by @mht-sharma in #3161
- setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in #3171
- transformers flash llm/vlm enabling in ipex by @sywangyi in #3152
- Upgrading the dependencies in Gaudi backend. by @Narsil in #3170
- Hotfixing gaudi deps. by @Narsil in #3174
- Hotfix gaudi2 with newer transformers. by @Narsil in #3176
- Support flashinfer for Gemma3 prefill by @danieldk in #3167
- Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in #2648
- Bump
sccache
to 0.10.0 by @alvarobartt in #3179 - Fixing CI by @Narsil in #3184
- Add option to configure prometheus port by @mht-sharma in #3187
- Warmup gaudi backend by @sywangyi in #3172
- Put more wiggle room. by @Narsil in #3189
- Fixing the router + template for Qwen3. by @Narsil in #3200
- Skip
{% generation %}
and{% endgeneration %}
template handling by @alvarobartt in #3204 - doc typo by @julien-c in #3206
- Pr 2982 ci branch by @drbh in #3046
- fix: bump snaps for mllama by @drbh in #3202
- Update client SDK snippets by @julien-c in #3207
- Fix
HF_HUB_OFFLINE=1
for Gaudi backend by @regisss in #3193 - IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in #3144
- forward and tokenize chooser use the same shape by @sywangyi in #3196
- Chunked Prefill VLM by @mht-sharma in #3188
- Prepare for 3.3.0 by @danieldk in #3220
New Contributors
Full Changelog: v3.2.3...v3.3.0
v3.2.3
Main changes
- Patching Llama 4
What's Changed
- Use ROCM 6.3.1 by @mht-sharma in #3141
- Update transformers to 4.51 by @mht-sharma in #3148
- Gaudi: Add Integration Test for Gaudi Backend by @baptistecolle in #3142
- fix: compute type typo by @oOraph in #3150
- 3.2.3 by @Narsil in #3151
Full Changelog: v3.2.2...v3.2.3
v3.2.2
What's Changed
- Minor fixes. by @Narsil in #3125
- configurable termination timeout by @ErikKaum in #3126
- CI: enable server tests for backends by @baptistecolle in #3128
- Torch 2.6 by @Narsil in #3134
- Gaudi: Fix llava-next and mllama crash issue by @yuanwu2017 in #3127
- nix-v3.2.1 -> v3.2.1-nix by @co42 in #3129
- Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE by @yuanwu2017 in #3131
- Add llama4 by @mht-sharma in #3145
- Preparing for release. by @Narsil in #3147
New Contributors
Full Changelog: v3.2.1...v3.2.2
v3.2.1
What's Changed
- Update to
kernels
0.2.1 by @danieldk in #3084 - Router: add
gemma3-text
model type by @danieldk in #3107 - We need gcc during runtime to enable triton to compile kernels. by @Narsil in #3103
- Release of Gaudi Backend for TGI by @baptistecolle in #3091
- Fixing the docker build. by @Narsil in #3108
- Make the Nix-based Docker container work on non-NixOS by @danieldk in #3109
- xpu 2.6 update by @sywangyi in #3051
- launcher: correctly get the head dimension for VLMs by @danieldk in #3116
- Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork by @baptistecolle in #3117
- Bug Fix: Sliding Window Attention by @mht-sharma in #3112
- Publish nix docker image. by @Narsil in #3122
- Prepare for patch release. by @Narsil in #3124
- Intel docker. by @Narsil in #3121
Full Changelog: v3.2.0...v3.2.1
v3.2.0
Important changes
-
BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.
-
Added Gemma 3 support.
What's Changed
- fix(neuron): explicitly install toolchain by @dacorvo in #3072
- Only add token when it is defined. by @Narsil in #3073
- Making sure Olmo (transformers backend) works. by @Narsil in #3074
- Making
tool_calls
a vector. by @Narsil in #3075 - Nix: add
openai
to impure shell for integration tests by @danieldk in #3081 - Update
--max-batch-total-tokens
description by @alvarobartt in #3083 - Fix tool call2 by @Narsil in #3076
- Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in #3085
- Add request parameters to OTel span for
/v1/chat/completions
endpoint by @aW3st in #3000 - Add qwen2 multi lora layers support by @EachSheep in #3089
- Add modules_to_not_convert in quantized model by @jiqing-feng in #3053
- Small test and typing fixes by @danieldk in #3078
- hotfix: qwen2 formatting by @danieldk in #3093
- Pr 3003 ci branch by @drbh in #3007
- Update the llamacpp backend by @angt in #3022
- Fix qwen vl by @Narsil in #3096
- Update README.md by @celsowm in #3095
- Fix tool call3 by @Narsil in #3086
- Add gemma3 model by @mht-sharma in #3099
- Fix tool call4 by @Narsil in #3094
- Update neuron backend by @dacorvo in #3098
- Preparing relase 3.2.0 by @Narsil in #3100
- Try to fix on main CI color. by @Narsil in #3101
New Contributors
- @EachSheep made their first contribution in #3089
- @jiqing-feng made their first contribution in #3053
Full Changelog: v3.1.1...v3.2.0
v3.1.1
What's Changed
- Back on nix main. by @Narsil in #2979
- hotfix: fix trtllm CI build on release by @Hugoch in #2981
- Add
strftime_now
callable function forminijinja
chat templates by @alvarobartt in #2983 - impureWithCuda: fix gcc version by @danieldk in #2990
- Improve qwen vl impl by @drbh in #2943
- Using the "lockfile". by @Narsil in #2992
- Triton fix by @sywangyi in #2995
- [Backend] Bump TRTLLM to v.0.17.0 by @mfuntowicz in #2991
- Updating mllama after strftime. by @Narsil in #2993
- Use kernels from the kernel hub by @danieldk in #2988
- fix Qwen VL break in intel platform by @sywangyi in #3002
- Update the flaky mllama test. by @Narsil in #3015
- Preventing single user hugging the server to death by asking by @Narsil in #3016
- Putting back the NCCL forced upgrade. by @Narsil in #2999
- Support sigmoid scoring function in GPTQ-MoE by @danieldk in #3017
- [Backend] Add Llamacpp backend by @angt in #2975
- Use eetq kernel from the hub by @danieldk in #3029
- Update README.md by @celsowm in #3024
- Add
loop_controls
feature tominijinja
to handle{% break %}
by @alvarobartt in #2998 - Pinning trufflehog. by @Narsil in #3032
- It's find in some machine. using hf_hub::api::sync::Api to download c⦠by @Narsil in #3030
- Improve Transformers support by @Cyrilvallez in #2970
- feat: add initial qwen2.5-vl model and test by @drbh in #2971
- Using public external registry (to use external runners for CI). by @Narsil in #3031
- Having less logs in case of failure for checking CI more easily. by @Narsil in #3037
- feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry by @Hugoch in #3027
- update ipex and torch to 2.6 for cpu by @sywangyi in #3039
- flashinfer 0.2.0.post1 -> post2 by @danieldk in #3040
- fix qwen2 vl crash in continous batching by @sywangyi in #3004
- Simplify logs2. by @Narsil in #3045
- Update Gradio ChatInterface configuration in consuming_tgi.md by @angt in #3042
- Improve tool call message processing by @drbh in #3036
- Use
rotary
kernel from the Hub by @danieldk in #3041 - Add Neuron backend by @dacorvo in #3033
- You need to seek apparently. by @Narsil in #3049
- some minor fix by @sywangyi in #3048
- fix: run linters and fix formatting by @drbh in #3057
- Avoid running neuron integration tests twice by @dacorvo in #3054
- Add Gaudi Backend by @baptistecolle in #3055
- Fix two edge cases in
RadixTrie::find
by @danieldk in #3067 - Add property-based testing for
RadixAllocator
by @danieldk in #3068 - feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. by @Hugoch in #3061
- Preparing for release. by @Narsil in #3060
- Fix a tiny typo in
monitoring.md
tutorial by @sadra-barikbin in #3056 - Patch rust release. by @Narsil in #3069
New Contributors
- @angt made their first contribution in #2975
- @celsowm made their first contribution in #3024
- @dacorvo made their first contribution in #3033
- @sadra-barikbin made their first contribution in #3056
Full Changelog: v3.1.0...v3.1.1
v3.1.0
Important changes
Deepseek R1 is fully supported on both AMD and Nvidia !
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1
What's Changed
- Attempt to remove AWS S3 flaky cache for sccache by @mfuntowicz in #2953
- Update to attention-kernels 0.2.0 by @danieldk in #2950
- fix: Telemetry by @Hugoch in #2957
- Fixing the oom maybe with 2.5.1 change. by @Narsil in #2958
- Add backend name to telemetry by @Hugoch in #2962
- Add fp8 support moe models by @mht-sharma in #2928
- Update to moe-kernels 0.8.0 by @danieldk in #2966
- Hotfixing intel-cpu (not sure how it was working before). by @Narsil in #2967
- Add deepseekv3 by @Narsil in #2968
- doc: Update TRTLLM deployment doc. by @Hugoch in #2960
- Update moe-kernel to 0.8.2 for rocm by @mht-sharma in #2977
- Prepare for release 3.1.0 by @Narsil in #2972
Full Changelog: v3.0.2...v3.1.0
v3.0.2
Tl;dr
New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez
New models unlocked: Cohere2, olmo, olmo2, helium.
What's Changed
- docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in #2814
- Fixing latest flavor by disabling it. by @Narsil in #2831
- fix facebook/opt-125m not working issue by @sywangyi in #2824
- Fixup opt to reduce the amount of odd if statements. by @Narsil in #2833
- TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in #2791
- Feat/trtllm cancellation dev container by @Hugoch in #2795
- New arg. by @Narsil in #2845
- Fixing CI. by @Narsil in #2846
- fix: lint backend and doc files by @drbh in #2850
- Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in #2840
- Update vllm kernels for ROCM by @mht-sharma in #2826
- change xpu lib download link by @sywangyi in #2852
- fix: include add_special_tokens in kserve request by @drbh in #2859
- chore: fixed some typos and attribute issues in README by @ruidazeng in #2891
- update ipex xpu to fix issue in ARC770 by @sywangyi in #2884
- Basic flashinfer 0.2 support by @danieldk in #2862
- Improve vlm support (add idefics3 support) by @drbh in #2437
- Update to marlin-kernels 0.3.7 by @danieldk in #2882
- chore: Update jsonschema to 0.28.0 by @Stranger6667 in #2870
- Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in #2837
- Update using_guidance.md by @nbroad1881 in #2901
- fix crash in torch2.6 if TP=1 by @sywangyi in #2885
- Add Flash decoding kernel ROCm by @mht-sharma in #2855
- Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in #2825
- Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in #2903
- docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in #2863
- Fix
docker run
inREADME.md
by @alvarobartt in #2861 - π add guide on using TPU with TGI in the docs by @baptistecolle in #2907
- Upgrading our rustc version. by @Narsil in #2908
- Fix typo in TPU docs by @baptistecolle in #2911
- Removing the github runner. by @Narsil in #2912
- Upgrading bitsandbytes. by @Narsil in #2910
- Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in #2917
- feat: improve star coder to support multi lora layers by @drbh in #2883
- Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in #2815
- nix: update to PyTorch 2.5.1 by @danieldk in #2921
- Moving to
uv
instead ofpoetry
. by @Narsil in #2919 - Add fp8 kv cache for ROCm by @mht-sharma in #2856
- fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in #2918
- feat: improve qwen2-vl startup by @drbh in #2802
- Revert "feat: improve qwen2-vl startup " by @drbh in #2924
- flashinfer: switch to plan API by @danieldk in #2904
- Fixing TRTLLM dockerfile. by @Narsil in #2922
- Flash Transformers modeling backend support by @Cyrilvallez in #2913
- Give TensorRT-LLMa proper CI/CD π by @mfuntowicz in #2886
- Trying to avoid the random timeout. by @Narsil in #2929
- Run
pre-commit run --all-files
to fix CI by @alvarobartt in #2933 - Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in #2937
- fix moe in quantization path by @sywangyi in #2935
- Clarify FP8-Marlin use on capability 8.9 by @danieldk in #2940
- Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in #2931
- Set
alias
formax_completion_tokens
inChatRequest
by @alvarobartt in #2932 - Add NVIDIA A40 to known cards by @kldzj in #2941
- [TRTLLM] Expose finish reason by @mfuntowicz in #2841
- Tmp tp transformers by @Narsil in #2942
- Transformers backend TP fix by @Cyrilvallez in #2945
- Trying to put back the archlist (to fix the oom). by @Narsil in #2947
New Contributors
- @janne-alatalo made their first contribution in #2840
- @ruidazeng made their first contribution in #2891
- @Stranger6667 made their first contribution in #2870
- @lazariv made their first contribution in #2837
- @baptistecolle made their first contribution in #2907
- @Cyrilvallez made their first contribution in #2913
- @kldzj made their first contribution in #2941
Full Changelog: v3.0.1...v3.0.2