Releases: huggingface/transformers.js
3.4.1
What's new?
- Add support for SNAC (Multi-Scale Neural Audio Codec) in #1251
- Add support for Metric3D (v1 & v2) in #1254
- Add support for Gemma 3 text in #1229. Note: Only Node.js execution is supported for now.
- Safeguard against background removal pipeline precision issues in #1255. Thanks to @LuSrodri for reporting the issue!
- Allow RawImage to read from all types of supported sources by @BritishWerewolf in #1244
- Update pipelines.md api docs in #1256
- Update extension example to use latest version by @fs-eire in #1213
Full Changelog: 3.4.0...3.4.1
3.4.0
🚀 Transformers.js v3.4 — Background Removal Pipeline, Ultravox DAC, Mimi, SmolVLM2, LiteWhisper.
- 🖼️ Background Removal Pipeline
- 🤖 New models: Ultravox DAC, Mimi, SmolVLM2, LiteWhisper
- 🛠️ Other improvements
- 🤗 New contributors
🖼️ New Background Removal Pipeline
Removing backgrounds from images is now as easy as:
import { pipeline } from "@huggingface/transformers";
const segmenter = await pipeline("background-removal", "onnx-community/BEN2-ONNX");
const output = await segmenter("input.png");
output[0].save("output.png"); // (Optional) Save the image
You can find the full list of compatible models here, which will continue to grow in future! 🔥 For more information, check out #1216.
🤖 New models
-
Ultravox for audio-text-to-text generation (#1207). See here for the list of supported models.
See example usage
import { UltravoxProcessor, UltravoxModel, read_audio } from "@huggingface/transformers"; const processor = await UltravoxProcessor.from_pretrained( "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX", ); const model = await UltravoxModel.from_pretrained( "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX", { dtype: { embed_tokens: "q8", // "fp32", "fp16", "q8" audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16" decoder_model_merged: "q4", // "q8", "q4", "q4f16" }, }, ); const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000); const messages = [ { role: "system", content: "You are a helpful assistant.", }, { role: "user", content: "Transcribe this audio:<|audio|>" }, ]; const text = processor.tokenizer.apply_chat_template(messages, { add_generation_prompt: true, tokenize: false, }); const inputs = await processor(text, audio); const generated_ids = await model.generate({ ...inputs, max_new_tokens: 128, }); const generated_texts = processor.batch_decode( generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]), { skip_special_tokens: true }, ); console.log(generated_texts[0]); // "I can transcribe the audio for you. Here's the transcription:\n\n\"I have a dream that one day this nation will rise up and live out the true meaning of its creed.\"\n\n- Martin Luther King Jr.\n\nWould you like me to provide the transcription in a specific format (e.g., word-for-word, character-for-character, or a specific font)?"
-
DAC and Mimi for audio tokenization/neural audio codecs (#1215). See here for the list of supported DAC models and here for the list of supported Mimi models.
See example usage
DAC:
import { DacModel, AutoFeatureExtractor } from '@huggingface/transformers'; const model_id = "onnx-community/dac_16khz-ONNX"; const model = await DacModel.from_pretrained(model_id); const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id); const audio_sample = new Float32Array(12000); // pre-process the inputs const inputs = await feature_extractor(audio_sample); { // explicitly encode then decode the audio inputs const encoder_outputs = await model.encode(inputs); const { audio_values } = await model.decode(encoder_outputs); console.log(audio_values); } { // or the equivalent with a forward pass const { audio_values } = await model(inputs); console.log(audio_values); }
Mimi:
import { MimiModel, AutoFeatureExtractor } from '@huggingface/transformers'; const model_id = "onnx-community/kyutai-mimi-ONNX"; const model = await MimiModel.from_pretrained(model_id); const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id); const audio_sample = new Float32Array(12000); // pre-process the inputs const inputs = await feature_extractor(audio_sample); { // explicitly encode then decode the audio inputs const encoder_outputs = await model.encode(inputs); const { audio_values } = await model.decode(encoder_outputs); console.log(audio_values); } { // or the equivalent with a forward pass const { audio_values } = await model(inputs); console.log(audio_values); }
-
SmolVLM2, a lightweight multimodal model designed to analyze image and video content (#1196). See here for the list of supported models. Usage is identical to SmolVLM.
-
LiteWhisper for automatic speech recognition (#1219). See here for the list of supported models. Usage is identical to Whisper.
🛠️ Other improvements
- Add support for multi-chunk external data files in #1212
- Fix package export by @fs-eire in #1161
- Add NFD normalizer in #1211. Thanks to @adewdev for reporting!
- Documentation improvements by @viksit in #1184
- Optimize conversion script in #1204 and #1218
- Use Float16Array instead of Uint16Array for kvcache when available in #1208
🤗 New contributors
- @axrati made their first contribution in #602
- @viksit made their first contribution in #1184
- @tangkunyin made their first contribution in #1203
Full Changelog: 3.3.3...3.4.0
3.3.3
3.3.2
What's new?
- Add support for Helium and Glm in #1156
- Improve build process and fix usage with certain bundlers in #1158
- Auto-detect wordpiece tokenizer when model.type is missing in #1151
- Update Moonshine config values for transformers v4.48.0 in #1155
- Support simultaneous tensor op execution in WASM in #1162
- Update react tutorial sample code in #1152
Full Changelog: 3.3.1...3.3.2
3.3.1
3.3.0
🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection
🤖 New models: StyleTTS 2, Grounding DINO
StyleTTS 2 for high-quality speech synthesis
See #1148 for more information and here for the list of supported models.
First, install the kokoro-js
library, which uses Transformers.js, from NPM using:
npm i kokoro-js
You can then generate speech as follows:
import { KokoroTTS } from "kokoro-js";
const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});
const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
// Use `tts.list_voices()` to list all available voices
voice: "af_bella",
});
audio.save("audio.wav");
Grounding DINO for zero-shot object detection
See #1137 for more information and here for the list of supported models.
Example: Zero-shot object detection with onnx-community/grounding-dino-tiny-ONNX
using the pipeline
API.
import { pipeline } from "@huggingface/transformers";
const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");
const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
threshold: 0.3,
});
See example output
[
{ score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
{ score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]
🛠️ Other improvements
- Add the RawAudio class by @Th3G33k in #682
- Update React guide for v3 by @sroussey in #1128
- Add option to skip special tokens in TextStreamer by @sroussey in #1139
🤗 New contributors
Full Changelog: 3.2.4...3.3.0
3.2.4
What's new?
-
Add support for visualizing self-attention heatmaps in #1117
Example code
import { AutoProcessor, AutoModelForImageClassification, interpolate_4d, RawImage } from "@huggingface/transformers"; // Load model and processor const model_id = "onnx-community/dinov2-with-registers-small-with-attentions"; const model = await AutoModelForImageClassification.from_pretrained(model_id); const processor = await AutoProcessor.from_pretrained(model_id); // Load image from URL const image = await RawImage.read("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg"); // Pre-process image const inputs = await processor(image); // Perform inference const { logits, attentions } = await model(inputs); // Get the predicted class const cls = logits[0].argmax().item(); const label = model.config.id2label[cls]; console.log(`Predicted class: ${label}`); // Set config values const patch_size = model.config.patch_size; const [width, height] = inputs.pixel_values.dims.slice(-2); const w_featmap = Math.floor(width / patch_size); const h_featmap = Math.floor(height / patch_size); const num_heads = model.config.num_attention_heads; const num_cls_tokens = 1; const num_register_tokens = model.config.num_register_tokens ?? 0; // Visualize attention maps const selected_attentions = attentions .at(-1) // we are only interested in the attention maps of the last layer .slice(0, null, 0, [num_cls_tokens + num_register_tokens, null]) .view(num_heads, 1, w_featmap, h_featmap); const upscaled = await interpolate_4d(selected_attentions, { size: [width, height], mode: "nearest", }); for (let i = 0; i < num_heads; ++i) { const head_attentions = upscaled[i]; const minval = head_attentions.min().item(); const maxval = head_attentions.max().item(); const image = RawImage.fromTensor( head_attentions .sub_(minval) .div_(maxval - minval) .mul_(255) .to("uint8"), ); await image.save(`attn-head-${i}.png`); }
-
Add
min
,max
,argmin
,argmax
tensor ops fordim=null
-
Add support for nearest-neighbour interpolation in
interpolate_4d
-
Depth Estimation pipeline improvements (faster & returns resized depth map)
-
TypeScript improvements by @ocavue and @shrirajh in #1081 and #1122
-
Remove unused imports from tokenizers.js by @pratapvardhan in #1116
New Contributors
- @shrirajh made their first contribution in #1122
- @pratapvardhan made their first contribution in #1116
Full Changelog: 3.2.3...3.2.4
3.2.3
What's new?
- Fix setting of model_file_name for image feature extraction pipeline in #1114. Thanks @xitanggg for reporting the issue!
- Add support for dinov2 with registers in #1110. Example usage:
import { pipeline } from '@huggingface/transformers'; // Create image classification pipeline const classifier = await pipeline('image-classification', 'onnx-community/dinov2-with-registers-small-imagenet1k-1-layer'); // Classify an image const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'; const output = await classifier(url); console.log(output); // [ // { label: 'tabby, tabby cat', score: 0.8135351538658142 }, // { label: 'tiger cat', score: 0.08967583626508713 }, // { label: 'Egyptian cat', score: 0.06800546497106552 }, // { label: 'radiator', score: 0.003501888597384095 }, // { label: 'quilt, comforter, comfort, puff', score: 0.003408448537811637 }, // ]
Full Changelog: 3.2.2...3.2.3
3.2.2
3.2.1
What's new?
-
Add support for ModernBert in #1104. Check out the blog post for more information!
Example:
import { pipeline } from '@huggingface/transformers'; const pipe = await pipeline('fill-mask', 'answerdotai/ModernBERT-base'); const answer = await pipe('The capital of France is [MASK].'); console.log(answer);
Full Changelog: 3.2.0...3.2.1