Releases · huggingface/text-generation-inference

30 May 14:20

danieldk

v3.3.2

8e41da9

v3.3.2 Latest

Latest

Gaudi improvements.

What's Changed

upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in #3239
Nix: switch to hf-nix by @danieldk in #3240
Add Qwen3 by @yuanwu2017 in #3229
fp8 compressed_tensors w8a8 support by @sywangyi in #3242
[Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in #3245
Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in #3246
Prepare for 3.3.2 by @danieldk in #3249

Full Changelog: v3.3.1...v3.3.2

Contributors

danieldk, yuanwu2017, and sywangyi

Assets 2

22 May 07:49

danieldk

v3.3.1

767a652

v3.3.1

This release updates TGI to Torch 2.7 and CUDA 12.8.

What's Changed

change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #3217
adjust the round_up_seq logic to align with prefill warmup phase on… by @kaixuanliu in #3224
Update to Torch 2.7.0 by @danieldk in #3221
Enable Llama4 for gaudi backend by @yuanwu2017 in #3223
fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in #3230
Deepseek r1 by @sywangyi in #3211
Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in #3234
fix the crash in default ATTENTION path by @sywangyi in #3235
Switch to punica-sgmv kernel from the Hub by @danieldk in #3236
move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in #3237
Prepare for 3.3.1 by @danieldk in #3238

New Contributors

@kaixuanliu made their first contribution in #3217

Full Changelog: v3.3.0...v3.3.1

Contributors

danieldk, drbh, and 3 other contributors

Assets 2

09 May 13:57

danieldk

v3.3.0

03a8b8d

v3.3.0

Notable changes

Prefill chunking for VLMs.

What's Changed

Fixing Qwen 2.5 VL (32B). by @Narsil in #3157
Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in #3156
Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in #3113
L4 fixes by @mht-sharma in #3161
setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in #3171
transformers flash llm/vlm enabling in ipex by @sywangyi in #3152
Upgrading the dependencies in Gaudi backend. by @Narsil in #3170
Hotfixing gaudi deps. by @Narsil in #3174
Hotfix gaudi2 with newer transformers. by @Narsil in #3176
Support flashinfer for Gemma3 prefill by @danieldk in #3167
Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in #2648
Bump sccache to 0.10.0 by @alvarobartt in #3179
Fixing CI by @Narsil in #3184
Add option to configure prometheus port by @mht-sharma in #3187
Warmup gaudi backend by @sywangyi in #3172
Put more wiggle room. by @Narsil in #3189
Fixing the router + template for Qwen3. by @Narsil in #3200
Skip {% generation %} and {% endgeneration %} template handling by @alvarobartt in #3204
doc typo by @julien-c in #3206
Pr 2982 ci branch by @drbh in #3046
fix: bump snaps for mllama by @drbh in #3202
Update client SDK snippets by @julien-c in #3207
Fix HF_HUB_OFFLINE=1 for Gaudi backend by @regisss in #3193
IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in #3144
forward and tokenize chooser use the same shape by @sywangyi in #3196
Chunked Prefill VLM by @mht-sharma in #3188
Prepare for 3.3.0 by @danieldk in #3220

New Contributors

@kozistr made their first contribution in #2648
@julien-c made their first contribution in #3206

Full Changelog: v3.2.3...v3.3.0

Contributors

danieldk, Narsil, and 7 other contributors

Assets 2

08 Apr 08:18

Narsil

v3.2.3

a1f3ebe

v3.2.3

Main changes

Patching Llama 4

What's Changed

Use ROCM 6.3.1 by @mht-sharma in #3141
Update transformers to 4.51 by @mht-sharma in #3148
Gaudi: Add Integration Test for Gaudi Backend by @baptistecolle in #3142
fix: compute type typo by @oOraph in #3150
3.2.3 by @Narsil in #3151

Full Changelog: v3.2.2...v3.2.3

Contributors

Narsil, oOraph, and 2 other contributors

Assets 2

06 Apr 09:41

Narsil

v3.2.2

c67546f

v3.2.2

What's Changed

Minor fixes. by @Narsil in #3125
configurable termination timeout by @ErikKaum in #3126
CI: enable server tests for backends by @baptistecolle in #3128
Torch 2.6 by @Narsil in #3134
Gaudi: Fix llava-next and mllama crash issue by @yuanwu2017 in #3127
nix-v3.2.1 -> v3.2.1-nix by @co42 in #3129
Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE by @yuanwu2017 in #3131
Add llama4 by @mht-sharma in #3145
Preparing for release. by @Narsil in #3147

New Contributors

@co42 made their first contribution in #3129

Full Changelog: v3.2.1...v3.2.2

Contributors

co42, Narsil, and 4 other contributors

Assets 2

18 Mar 14:28

Narsil

v3.2.1

4d28897

v3.2.1

What's Changed

Update to kernels 0.2.1 by @danieldk in #3084
Router: add gemma3-text model type by @danieldk in #3107
We need gcc during runtime to enable triton to compile kernels. by @Narsil in #3103
Release of Gaudi Backend for TGI by @baptistecolle in #3091
Fixing the docker build. by @Narsil in #3108
Make the Nix-based Docker container work on non-NixOS by @danieldk in #3109
xpu 2.6 update by @sywangyi in #3051
launcher: correctly get the head dimension for VLMs by @danieldk in #3116
Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork by @baptistecolle in #3117
Bug Fix: Sliding Window Attention by @mht-sharma in #3112
Publish nix docker image. by @Narsil in #3122
Prepare for patch release. by @Narsil in #3124
Intel docker. by @Narsil in #3121

Full Changelog: v3.2.0...v3.2.1

Contributors

danieldk, Narsil, and 3 other contributors

Assets 2

12 Mar 10:17

Narsil

v3.2.0

411a282

v3.2.0

Important changes

BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.
Added Gemma 3 support.

What's Changed

fix(neuron): explicitly install toolchain by @dacorvo in #3072
Only add token when it is defined. by @Narsil in #3073
Making sure Olmo (transformers backend) works. by @Narsil in #3074
Making tool_calls a vector. by @Narsil in #3075
Nix: add openai to impure shell for integration tests by @danieldk in #3081
Update --max-batch-total-tokens description by @alvarobartt in #3083
Fix tool call2 by @Narsil in #3076
Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in #3085
Add request parameters to OTel span for /v1/chat/completions endpoint by @aW3st in #3000
Add qwen2 multi lora layers support by @EachSheep in #3089
Add modules_to_not_convert in quantized model by @jiqing-feng in #3053
Small test and typing fixes by @danieldk in #3078
hotfix: qwen2 formatting by @danieldk in #3093
Pr 3003 ci branch by @drbh in #3007
Update the llamacpp backend by @angt in #3022
Fix qwen vl by @Narsil in #3096
Update README.md by @celsowm in #3095
Fix tool call3 by @Narsil in #3086
Add gemma3 model by @mht-sharma in #3099
Fix tool call4 by @Narsil in #3094
Update neuron backend by @dacorvo in #3098
Preparing relase 3.2.0 by @Narsil in #3100
Try to fix on main CI color. by @Narsil in #3101

New Contributors

@EachSheep made their first contribution in #3089
@jiqing-feng made their first contribution in #3053

Full Changelog: v3.1.1...v3.2.0

Contributors

danieldk, Narsil, and 9 other contributors

Assets 2

04 Mar 17:15

Narsil

v3.1.1

c34bd9d

v3.1.1

What's Changed

Back on nix main. by @Narsil in #2979
hotfix: fix trtllm CI build on release by @Hugoch in #2981
Add strftime_now callable function for minijinja chat templates by @alvarobartt in #2983
impureWithCuda: fix gcc version by @danieldk in #2990
Improve qwen vl impl by @drbh in #2943
Using the "lockfile". by @Narsil in #2992
Triton fix by @sywangyi in #2995
[Backend] Bump TRTLLM to v.0.17.0 by @mfuntowicz in #2991
Updating mllama after strftime. by @Narsil in #2993
Use kernels from the kernel hub by @danieldk in #2988
fix Qwen VL break in intel platform by @sywangyi in #3002
Update the flaky mllama test. by @Narsil in #3015
Preventing single user hugging the server to death by asking by @Narsil in #3016
Putting back the NCCL forced upgrade. by @Narsil in #2999
Support sigmoid scoring function in GPTQ-MoE by @danieldk in #3017
[Backend] Add Llamacpp backend by @angt in #2975
Use eetq kernel from the hub by @danieldk in #3029
Update README.md by @celsowm in #3024
Add loop_controls feature to minijinja to handle {% break %} by @alvarobartt in #2998
Pinning trufflehog. by @Narsil in #3032
It's find in some machine. using hf_hub::api::sync::Api to download c… by @Narsil in #3030
Improve Transformers support by @Cyrilvallez in #2970
feat: add initial qwen2.5-vl model and test by @drbh in #2971
Using public external registry (to use external runners for CI). by @Narsil in #3031
Having less logs in case of failure for checking CI more easily. by @Narsil in #3037
feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry by @Hugoch in #3027
update ipex and torch to 2.6 for cpu by @sywangyi in #3039
flashinfer 0.2.0.post1 -> post2 by @danieldk in #3040
fix qwen2 vl crash in continous batching by @sywangyi in #3004
Simplify logs2. by @Narsil in #3045
Update Gradio ChatInterface configuration in consuming_tgi.md by @angt in #3042
Improve tool call message processing by @drbh in #3036
Use rotary kernel from the Hub by @danieldk in #3041
Add Neuron backend by @dacorvo in #3033
You need to seek apparently. by @Narsil in #3049
some minor fix by @sywangyi in #3048
fix: run linters and fix formatting by @drbh in #3057
Avoid running neuron integration tests twice by @dacorvo in #3054
Add Gaudi Backend by @baptistecolle in #3055
Fix two edge cases in RadixTrie::find by @danieldk in #3067
Add property-based testing for RadixAllocator by @danieldk in #3068
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. by @Hugoch in #3061
Preparing for release. by @Narsil in #3060
Fix a tiny typo in monitoring.md tutorial by @sadra-barikbin in #3056
Patch rust release. by @Narsil in #3069

New Contributors

@angt made their first contribution in #2975
@celsowm made their first contribution in #3024
@dacorvo made their first contribution in #3033
@sadra-barikbin made their first contribution in #3056

Full Changelog: v3.1.0...v3.1.1

Contributors

danieldk, Narsil, and 11 other contributors

Assets 2

31 Jan 13:26

Narsil

v3.1.0

463228e

v3.1.0

Important changes

Deepseek R1 is fully supported on both AMD and Nvidia !

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1

What's Changed

Attempt to remove AWS S3 flaky cache for sccache by @mfuntowicz in #2953
Update to attention-kernels 0.2.0 by @danieldk in #2950
fix: Telemetry by @Hugoch in #2957
Fixing the oom maybe with 2.5.1 change. by @Narsil in #2958
Add backend name to telemetry by @Hugoch in #2962
Add fp8 support moe models by @mht-sharma in #2928
Update to moe-kernels 0.8.0 by @danieldk in #2966
Hotfixing intel-cpu (not sure how it was working before). by @Narsil in #2967
Add deepseekv3 by @Narsil in #2968
doc: Update TRTLLM deployment doc. by @Hugoch in #2960
Update moe-kernel to 0.8.2 for rocm by @mht-sharma in #2977
Prepare for release 3.1.0 by @Narsil in #2972

Full Changelog: v3.0.2...v3.1.0

Contributors

danieldk, Narsil, and 3 other contributors

Assets 2

24 Jan 11:16

Narsil

v3.0.2

b70f29d

v3.0.2

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in #2814
Fixing latest flavor by disabling it. by @Narsil in #2831
fix facebook/opt-125m not working issue by @sywangyi in #2824
Fixup opt to reduce the amount of odd if statements. by @Narsil in #2833
TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in #2791
Feat/trtllm cancellation dev container by @Hugoch in #2795
New arg. by @Narsil in #2845
Fixing CI. by @Narsil in #2846
fix: lint backend and doc files by @drbh in #2850
Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in #2840
Update vllm kernels for ROCM by @mht-sharma in #2826
change xpu lib download link by @sywangyi in #2852
fix: include add_special_tokens in kserve request by @drbh in #2859
chore: fixed some typos and attribute issues in README by @ruidazeng in #2891
update ipex xpu to fix issue in ARC770 by @sywangyi in #2884
Basic flashinfer 0.2 support by @danieldk in #2862
Improve vlm support (add idefics3 support) by @drbh in #2437
Update to marlin-kernels 0.3.7 by @danieldk in #2882
chore: Update jsonschema to 0.28.0 by @Stranger6667 in #2870
Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in #2837
Update using_guidance.md by @nbroad1881 in #2901
fix crash in torch2.6 if TP=1 by @sywangyi in #2885
Add Flash decoding kernel ROCm by @mht-sharma in #2855
Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in #2825
Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in #2903
docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in #2863
Fix docker run in README.md by @alvarobartt in #2861
📝 add guide on using TPU with TGI in the docs by @baptistecolle in #2907
Upgrading our rustc version. by @Narsil in #2908
Fix typo in TPU docs by @baptistecolle in #2911
Removing the github runner. by @Narsil in #2912
Upgrading bitsandbytes. by @Narsil in #2910
Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in #2917
feat: improve star coder to support multi lora layers by @drbh in #2883
Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in #2815
nix: update to PyTorch 2.5.1 by @danieldk in #2921
Moving to uv instead of poetry. by @Narsil in #2919
Add fp8 kv cache for ROCm by @mht-sharma in #2856
fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in #2918
feat: improve qwen2-vl startup by @drbh in #2802
Revert "feat: improve qwen2-vl startup " by @drbh in #2924
flashinfer: switch to plan API by @danieldk in #2904
Fixing TRTLLM dockerfile. by @Narsil in #2922
Flash Transformers modeling backend support by @Cyrilvallez in #2913
Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in #2886
Trying to avoid the random timeout. by @Narsil in #2929
Run pre-commit run --all-files to fix CI by @alvarobartt in #2933
Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in #2937
fix moe in quantization path by @sywangyi in #2935
Clarify FP8-Marlin use on capability 8.9 by @danieldk in #2940
Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in #2931
Set alias for max_completion_tokens in ChatRequest by @alvarobartt in #2932
Add NVIDIA A40 to known cards by @kldzj in #2941
[TRTLLM] Expose finish reason by @mfuntowicz in #2841
Tmp tp transformers by @Narsil in #2942
Transformers backend TP fix by @Cyrilvallez in #2945
Trying to put back the archlist (to fix the oom). by @Narsil in #2947

New Contributors

@janne-alatalo made their first contribution in #2840
@ruidazeng made their first contribution in #2891
@Stranger6667 made their first contribution in #2870
@lazariv made their first contribution in #2837
@baptistecolle made their first contribution in #2907
@Cyrilvallez made their first contribution in #2913
@kldzj made their first contribution in #2941

Full Changelog: v3.0.1...v3.0.2

Contributors

danieldk, Narsil, and 15 other contributors

Assets 2

Releases: huggingface/text-generation-inference

v3.3.2

What's Changed

Contributors

Uh oh!

v3.3.1

What's Changed

New Contributors

Contributors

Uh oh!

v3.3.0

Notable changes

What's Changed

New Contributors

Contributors

Uh oh!

v3.2.3

Main changes

What's Changed

Contributors

Uh oh!

v3.2.2

What's Changed

New Contributors

Contributors

Uh oh!

v3.2.1

What's Changed

Contributors

Uh oh!

v3.2.0

Important changes

What's Changed

New Contributors

Contributors

Uh oh!

v3.1.1

What's Changed

New Contributors

Contributors

Uh oh!

v3.1.0

Important changes

What's Changed

Contributors

Uh oh!

v3.0.2

What's Changed

New Contributors

Contributors

Uh oh!