Skip to content

3.4.0

Compare
Choose a tag to compare
@xenova xenova released this 07 Mar 12:04
· 13 commits to main since this release
5b5e5ed

πŸš€ Transformers.js v3.4 β€” Background Removal Pipeline, Ultravox DAC, Mimi, SmolVLM2, LiteWhisper.

πŸ–ΌοΈ New Background Removal Pipeline

Removing backgrounds from images is now as easy as:

import { pipeline } from "@huggingface/transformers";
const segmenter = await pipeline("background-removal", "onnx-community/BEN2-ONNX");
const output = await segmenter("input.png");
output[0].save("output.png"); // (Optional) Save the image

You can find the full list of compatible models here, which will continue to grow in future! πŸ”₯ For more information, check out #1216.

πŸ€– New models

  • Ultravox for audio-text-to-text generation (#1207). See here for the list of supported models.

    See example usage
    import { UltravoxProcessor, UltravoxModel, read_audio } from "@huggingface/transformers";
    
    const processor = await UltravoxProcessor.from_pretrained(
      "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
    );
    const model = await UltravoxModel.from_pretrained(
      "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
      {
        dtype: {
          embed_tokens: "q8", // "fp32", "fp16", "q8"
          audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
          decoder_model_merged: "q4", // "q8", "q4", "q4f16"
        },
      },
    );
    
    const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000);
    const messages = [
      {
        role: "system",
        content: "You are a helpful assistant.",
      },
      { role: "user", content: "Transcribe this audio:<|audio|>" },
    ];
    const text = processor.tokenizer.apply_chat_template(messages, {
      add_generation_prompt: true,
      tokenize: false,
    });
    
    const inputs = await processor(text, audio);
    const generated_ids = await model.generate({
      ...inputs,
      max_new_tokens: 128,
    });
    
    const generated_texts = processor.batch_decode(
      generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
      { skip_special_tokens: true },
    );
    console.log(generated_texts[0]);
    // "I can transcribe the audio for you. Here's the transcription:\n\n\"I have a dream that one day this nation will rise up and live out the true meaning of its creed.\"\n\n- Martin Luther King Jr.\n\nWould you like me to provide the transcription in a specific format (e.g., word-for-word, character-for-character, or a specific font)?"
  • DAC and Mimi for audio tokenization/neural audio codecs (#1215). See here for the list of supported DAC models and here for the list of supported Mimi models.

    See example usage

    DAC:

    import { DacModel, AutoFeatureExtractor } from '@huggingface/transformers';
    
    const model_id = "onnx-community/dac_16khz-ONNX";
    const model = await DacModel.from_pretrained(model_id);
    const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id);
    
    const audio_sample = new Float32Array(12000);
    
    // pre-process the inputs
    const inputs = await feature_extractor(audio_sample);
    {
        // explicitly encode then decode the audio inputs
        const encoder_outputs = await model.encode(inputs);
        const { audio_values } = await model.decode(encoder_outputs);
        console.log(audio_values);
    }
    
    {
        // or the equivalent with a forward pass
        const { audio_values } = await model(inputs);
        console.log(audio_values);
    }

    Mimi:

    import { MimiModel, AutoFeatureExtractor } from '@huggingface/transformers';
    
    const model_id = "onnx-community/kyutai-mimi-ONNX";
    const model = await MimiModel.from_pretrained(model_id);
    const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id);
    
    const audio_sample = new Float32Array(12000);
    
    // pre-process the inputs
    const inputs = await feature_extractor(audio_sample);
    {
        // explicitly encode then decode the audio inputs
        const encoder_outputs = await model.encode(inputs);
        const { audio_values } = await model.decode(encoder_outputs);
        console.log(audio_values);
    }
    
    {
        // or the equivalent with a forward pass
        const { audio_values } = await model(inputs);
        console.log(audio_values);
    }
  • SmolVLM2, a lightweight multimodal model designed to analyze image and video content (#1196). See here for the list of supported models. Usage is identical to SmolVLM.

  • LiteWhisper for automatic speech recognition (#1219). See here for the list of supported models. Usage is identical to Whisper.

πŸ› οΈ Other improvements

  • Add support for multi-chunk external data files in #1212
  • Fix package export by @fs-eire in #1161
  • Add NFD normalizer in #1211. Thanks to @adewdev for reporting!
  • Documentation improvements by @viksit in #1184
  • Optimize conversion script in #1204 and #1218
  • Use Float16Array instead of Uint16Array for kvcache when available in #1208

πŸ€— New contributors

Full Changelog: 3.3.3...3.4.0