Skip to content

Releases: huggingface/transformers.js

3.4.1

25 Mar 22:30
39a75ce
Compare
Choose a tag to compare

What's new?

  • Add support for SNAC (Multi-Scale Neural Audio Codec) in #1251
  • Add support for Metric3D (v1 & v2) in #1254
  • Add support for Gemma 3 text in #1229. Note: Only Node.js execution is supported for now.
  • Safeguard against background removal pipeline precision issues in #1255. Thanks to @LuSrodri for reporting the issue!
  • Allow RawImage to read from all types of supported sources by @BritishWerewolf in #1244
  • Update pipelines.md api docs in #1256
  • Update extension example to use latest version by @fs-eire in #1213

Full Changelog: 3.4.0...3.4.1

3.4.0

07 Mar 12:04
5b5e5ed
Compare
Choose a tag to compare

🚀 Transformers.js v3.4 — Background Removal Pipeline, Ultravox DAC, Mimi, SmolVLM2, LiteWhisper.

🖼️ New Background Removal Pipeline

Removing backgrounds from images is now as easy as:

import { pipeline } from "@huggingface/transformers";
const segmenter = await pipeline("background-removal", "onnx-community/BEN2-ONNX");
const output = await segmenter("input.png");
output[0].save("output.png"); // (Optional) Save the image

You can find the full list of compatible models here, which will continue to grow in future! 🔥 For more information, check out #1216.

🤖 New models

  • Ultravox for audio-text-to-text generation (#1207). See here for the list of supported models.

    See example usage
    import { UltravoxProcessor, UltravoxModel, read_audio } from "@huggingface/transformers";
    
    const processor = await UltravoxProcessor.from_pretrained(
      "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
    );
    const model = await UltravoxModel.from_pretrained(
      "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
      {
        dtype: {
          embed_tokens: "q8", // "fp32", "fp16", "q8"
          audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
          decoder_model_merged: "q4", // "q8", "q4", "q4f16"
        },
      },
    );
    
    const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000);
    const messages = [
      {
        role: "system",
        content: "You are a helpful assistant.",
      },
      { role: "user", content: "Transcribe this audio:<|audio|>" },
    ];
    const text = processor.tokenizer.apply_chat_template(messages, {
      add_generation_prompt: true,
      tokenize: false,
    });
    
    const inputs = await processor(text, audio);
    const generated_ids = await model.generate({
      ...inputs,
      max_new_tokens: 128,
    });
    
    const generated_texts = processor.batch_decode(
      generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
      { skip_special_tokens: true },
    );
    console.log(generated_texts[0]);
    // "I can transcribe the audio for you. Here's the transcription:\n\n\"I have a dream that one day this nation will rise up and live out the true meaning of its creed.\"\n\n- Martin Luther King Jr.\n\nWould you like me to provide the transcription in a specific format (e.g., word-for-word, character-for-character, or a specific font)?"
  • DAC and Mimi for audio tokenization/neural audio codecs (#1215). See here for the list of supported DAC models and here for the list of supported Mimi models.

    See example usage

    DAC:

    import { DacModel, AutoFeatureExtractor } from '@huggingface/transformers';
    
    const model_id = "onnx-community/dac_16khz-ONNX";
    const model = await DacModel.from_pretrained(model_id);
    const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id);
    
    const audio_sample = new Float32Array(12000);
    
    // pre-process the inputs
    const inputs = await feature_extractor(audio_sample);
    {
        // explicitly encode then decode the audio inputs
        const encoder_outputs = await model.encode(inputs);
        const { audio_values } = await model.decode(encoder_outputs);
        console.log(audio_values);
    }
    
    {
        // or the equivalent with a forward pass
        const { audio_values } = await model(inputs);
        console.log(audio_values);
    }

    Mimi:

    import { MimiModel, AutoFeatureExtractor } from '@huggingface/transformers';
    
    const model_id = "onnx-community/kyutai-mimi-ONNX";
    const model = await MimiModel.from_pretrained(model_id);
    const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id);
    
    const audio_sample = new Float32Array(12000);
    
    // pre-process the inputs
    const inputs = await feature_extractor(audio_sample);
    {
        // explicitly encode then decode the audio inputs
        const encoder_outputs = await model.encode(inputs);
        const { audio_values } = await model.decode(encoder_outputs);
        console.log(audio_values);
    }
    
    {
        // or the equivalent with a forward pass
        const { audio_values } = await model(inputs);
        console.log(audio_values);
    }
  • SmolVLM2, a lightweight multimodal model designed to analyze image and video content (#1196). See here for the list of supported models. Usage is identical to SmolVLM.

  • LiteWhisper for automatic speech recognition (#1219). See here for the list of supported models. Usage is identical to Whisper.

🛠️ Other improvements

  • Add support for multi-chunk external data files in #1212
  • Fix package export by @fs-eire in #1161
  • Add NFD normalizer in #1211. Thanks to @adewdev for reporting!
  • Documentation improvements by @viksit in #1184
  • Optimize conversion script in #1204 and #1218
  • Use Float16Array instead of Uint16Array for kvcache when available in #1208

🤗 New contributors

Full Changelog: 3.3.3...3.4.0

3.3.3

06 Feb 23:33
829ace0
Compare
Choose a tag to compare

What's new?

  • Bump onnxruntime-web and @huggingface/jinja in #1183.

Full Changelog: 3.3.2...3.3.3

3.3.2

22 Jan 15:13
6f43f24
Compare
Choose a tag to compare

What's new?

  • Add support for Helium and Glm in #1156
  • Improve build process and fix usage with certain bundlers in #1158
  • Auto-detect wordpiece tokenizer when model.type is missing in #1151
  • Update Moonshine config values for transformers v4.48.0 in #1155
  • Support simultaneous tensor op execution in WASM in #1162
  • Update react tutorial sample code in #1152

Full Changelog: 3.3.1...3.3.2

3.3.1

15 Jan 15:36
e1753ac
Compare
Choose a tag to compare

What's new?

  • hotfix: Copy missing ort-wasm-simd-threaded.jsep.mjs to dist folder (#1150)

Full Changelog: 3.3.0...3.3.1

3.3.0

15 Jan 13:28
e00ff3b
Compare
Choose a tag to compare

🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection

🤖 New models: StyleTTS 2, Grounding DINO

StyleTTS 2 for high-quality speech synthesis

See #1148 for more information and here for the list of supported models.

First, install the kokoro-js library, which uses Transformers.js, from NPM using:

npm i kokoro-js

You can then generate speech as follows:

import { KokoroTTS } from "kokoro-js";

const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
  dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
  // Use `tts.list_voices()` to list all available voices
  voice: "af_bella",
});
audio.save("audio.wav");

Grounding DINO for zero-shot object detection

See #1137 for more information and here for the list of supported models.

Example: Zero-shot object detection with onnx-community/grounding-dino-tiny-ONNX using the pipeline API.

import { pipeline } from "@huggingface/transformers";

const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");

const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
  threshold: 0.3,
});
See example output
[
  { score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
  { score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]

🛠️ Other improvements

🤗 New contributors

Full Changelog: 3.2.4...3.3.0

3.2.4

28 Dec 12:03
307a490
Compare
Choose a tag to compare

What's new?

  • Add support for visualizing self-attention heatmaps in #1117

    Cats Attention Head 0 Attention Head 1 Attention Head 2
    Attention Head 3 Attention Head 4 Attention Head 5
    Example code
    import { AutoProcessor, AutoModelForImageClassification, interpolate_4d, RawImage } from "@huggingface/transformers";
    
    // Load model and processor
    const model_id = "onnx-community/dinov2-with-registers-small-with-attentions";
    const model = await AutoModelForImageClassification.from_pretrained(model_id);
    const processor = await AutoProcessor.from_pretrained(model_id);
    
    // Load image from URL
    const image = await RawImage.read("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg");
    
    // Pre-process image
    const inputs = await processor(image);
    
    // Perform inference
    const { logits, attentions } = await model(inputs);
    
    // Get the predicted class
    const cls = logits[0].argmax().item();
    const label = model.config.id2label[cls];
    console.log(`Predicted class: ${label}`);
    
    // Set config values
    const patch_size = model.config.patch_size;
    const [width, height] = inputs.pixel_values.dims.slice(-2);
    const w_featmap = Math.floor(width / patch_size);
    const h_featmap = Math.floor(height / patch_size);
    const num_heads = model.config.num_attention_heads;
    const num_cls_tokens = 1;
    const num_register_tokens = model.config.num_register_tokens ?? 0;
    
    // Visualize attention maps
    const selected_attentions = attentions
        .at(-1) // we are only interested in the attention maps of the last layer
        .slice(0, null, 0, [num_cls_tokens + num_register_tokens, null])
        .view(num_heads, 1, w_featmap, h_featmap);
    
    const upscaled = await interpolate_4d(selected_attentions, {
        size: [width, height],
        mode: "nearest",
    });
    
    for (let i = 0; i < num_heads; ++i) {
        const head_attentions = upscaled[i];
        const minval = head_attentions.min().item();
        const maxval = head_attentions.max().item();
        const image = RawImage.fromTensor(
            head_attentions
                .sub_(minval)
                .div_(maxval - minval)
                .mul_(255)
                .to("uint8"),
        );
        await image.save(`attn-head-${i}.png`);
    }
  • Add min, max, argmin, argmax tensor ops for dim=null

  • Add support for nearest-neighbour interpolation in interpolate_4d

  • Depth Estimation pipeline improvements (faster & returns resized depth map)

  • TypeScript improvements by @ocavue and @shrirajh in #1081 and #1122

  • Remove unused imports from tokenizers.js by @pratapvardhan in #1116

New Contributors

Full Changelog: 3.2.3...3.2.4

3.2.3

25 Dec 10:41
8e075f4
Compare
Choose a tag to compare

What's new?

  • Fix setting of model_file_name for image feature extraction pipeline in #1114. Thanks @xitanggg for reporting the issue!
  • Add support for dinov2 with registers in #1110. Example usage:
    import { pipeline } from '@huggingface/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'onnx-community/dinov2-with-registers-small-imagenet1k-1-layer');
    
    // Classify an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
    const output = await classifier(url);
    console.log(output);
    // [
    //   { label: 'tabby, tabby cat', score: 0.8135351538658142 },
    //   { label: 'tiger cat', score: 0.08967583626508713 },
    //   { label: 'Egyptian cat', score: 0.06800546497106552 },
    //   { label: 'radiator', score: 0.003501888597384095 },
    //   { label: 'quilt, comforter, comfort, puff', score: 0.003408448537811637 },
    // ]

Full Changelog: 3.2.2...3.2.3

3.2.2

23 Dec 15:05
da2c1e9
Compare
Choose a tag to compare

What's new?

  • Fix env.backends.onnx.wasm.proxy = true: Clone tensor if using onnx wasm proxy in #1108

Full Changelog: 3.2.1...3.2.2

3.2.1

19 Dec 17:02
074e97a
Compare
Choose a tag to compare

What's new?

  • Add support for ModernBert in #1104. Check out the blog post for more information!

    Example:

    import { pipeline } from '@huggingface/transformers';
    
    const pipe = await pipeline('fill-mask', 'answerdotai/ModernBERT-base');
    const answer = await pipe('The capital of France is [MASK].');
    console.log(answer);

    image

Full Changelog: 3.2.0...3.2.1