Skip to content

Releases: huggingface/transformers.js

2.13.2

03 Jan 14:57
Compare
Choose a tag to compare

What's new?

This release is a follow-up to #485, with additional intellisense-focused improvements (see PR).

typing-demo-new

Full Changelog: 2.13.1...2.13.2

2.13.1

03 Jan 11:24
Compare
Choose a tag to compare

What's new?

  • Improve typing of pipeline function in #485. Thanks to @wesbos for the suggestion!

    typing-demo

    This also means when you hover over the class name, you'll get example code to help you out.
    typing-demo2

  • Add phi-1_5 model in #493.

    See example code
    import { pipeline } from '@xenova/transformers';
    
    // Create a text-generation pipeline
    const generator = await pipeline('text-generation', 'Xenova/phi-1_5_dev');
    
    // Construct prompt
    const prompt = `\`\`\`py
    import math
    def print_prime(n):
        """
        Print all primes between 1 and n
        """`;
    
    // Generate text
    const result = await generator(prompt, {
      max_new_tokens: 100,
    });
    console.log(result[0].generated_text);

    Results in:

    import math
    def print_prime(n):
        """
        Print all primes between 1 and n
        """
        primes = []
        for num in range(2, n+1):
            is_prime = True
            for i in range(2, int(math.sqrt(num))+1):
                if num % i == 0:
                    is_prime = False
                    break
            if is_prime:
                primes.append(num)
        print(primes)
    
    print_prime(20)

    Running the code produces the correct result:

    [2, 3, 5, 7, 11, 13, 17, 19]
    

Full Changelog: 2.13.0...2.13.1

2.13.0

27 Dec 15:00
Compare
Choose a tag to compare

What's new?

🎄 7 new architectures!

This release adds support for many new multimodal architectures, bringing the total number of supported architectures to 80! 🤯

1. VITS for multilingual text-to-speech across over 1000 languages! (#466)

import { pipeline } from '@xenova/transformers';

// Create English text-to-speech pipeline
const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-eng');

// Generate speech
const output = await synthesizer('I love transformers');
// {
//   audio: Float32Array(26112) [...],
//   sampling_rate: 16000
// }
mms-tts-eng.mp4

See here for the list of available models. To start, we've converted 12 of the ~1140 models on the Hugging Face Hub. If we haven't added the one you wish to use, you can make it web-ready using our conversion script.

2. CLIPSeg for zero-shot image segmentation. (#478)

import { AutoTokenizer, AutoProcessor, CLIPSegForImageSegmentation, RawImage } from '@xenova/transformers';

// Load tokenizer, processor, and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clipseg-rd64-refined');
const processor = await AutoProcessor.from_pretrained('Xenova/clipseg-rd64-refined');
const model = await CLIPSegForImageSegmentation.from_pretrained('Xenova/clipseg-rd64-refined');

// Run tokenization
const texts = ['a glass', 'something to fill', 'wood', 'a jar'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Read image and run processor
const image = await RawImage.read('https://github.com/timojl/clipseg/blob/master/example_image.jpg?raw=true');
const image_inputs = await processor(image);

// Run model with both text and pixel inputs
const { logits } = await model({ ...text_inputs, ...image_inputs });
// logits: Tensor {
//   dims: [4, 352, 352],
//   type: 'float32',
//   data: Float32Array(495616)[ ... ],
//   size: 495616
// }

You can visualize the predictions as follows:

const preds = logits
  .unsqueeze_(1)
  .sigmoid_()
  .mul_(255)
  .round_()
  .to('uint8');

for (let i = 0; i < preds.dims[0]; ++i) {
  const img = RawImage.fromTensor(preds[i]);
  img.save(`prediction_${i}.png`);
}
Original "a glass" "something to fill" "wood" "a jar"
image prediction_0 prediction_1 prediction_2 prediction_3

See here for the list of available models.

3. SegFormer for semantic segmentation and image classification. (#480)

import { pipeline } from '@xenova/transformers';

// Create an image segmentation pipeline
const segmenter = await pipeline('image-segmentation', 'Xenova/segformer_b2_clothes');

// Segment an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/young-man-standing-and-leaning-on-car.jpg';
const output = await segmenter(url);

image

See output
[
  {
    score: null,
    label: 'Background',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Hair',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Upper-clothes',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Pants',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-shoe',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-shoe',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Face',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-leg',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-leg',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Left-arm',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  },
  {
    score: null,
    label: 'Right-arm',
    mask: RawImage {
      data: [Uint8ClampedArray],
      width: 970,
      height: 1455,
      channels: 1
    }
  }
]

See here for the list of available models.

4. Table Transformer for table extraction from unstructured documents. (#477)

import { pipeline } from '@xenova/transformers';

// Create an object detection pipeline
const detector = await pipeline('object-detection', 'Xenova/table-transformer-detection', { quantized: false });

// Detect tables in an image
const img = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/invoice-with-table.png';
const output = await detector(img);
// [{ score: 0.9967531561851501, label: 'table', box: { xmin: 52, ymin: 322, xmax: 546, ymax: 525 } }]
Show example output

image

See here for the list of available models.

5. DiT for document image classification. (#474)

import { pipeline } from '@xenova/transformers';

// Create an image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dit-base-finetuned-rvlcdip');

// Classify an image 
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/coca_cola_advertisement.png';
const output = await classifier(url);
// [{ label: 'advertisement', score: 0.9035086035728455 }]

See here for the list of available models.

6. SigLIP for zero-shot image classification. (#473)

import { pipeline } from '@xenova/transformers';

// Create a zero-shot image classification pipeline
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/siglip-base-patch16-224');

// Classify images according to provided labels
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url, ['2 cats', '2 dogs'], {
    hypothesis_template: 'a photo of {}',
});
// [
//   { score: 0.16770583391189575, label: '2 cats' },
//   { score: 0.000022096000975579955, label: '2 dogs' }
// ]

See here for the list of available models.

7. RoFormer for masked language modelling, sequence classification, token classification, and question answering. (#464)

import { pipeline } from '@xenova/transformers';

// Create a masked language modelling pipeline
const pipe = await pipeline('fill-mask', 'Xenova/antiberta2');

// Predict missing token
const output = await pipe('Ḣ Q V Q ... C A [MASK] D ... T V S S');
See output
[
  {
    score: 0.48774364590644836,
    token: 19,
    token_str: 'R',
    sequence: 'Ḣ Q V Q C A R D T V S S'
  },
  {
    score: 0.2768442928791046,
    token: 18,
    token_str: 'Q...
Read more

2.12.1

18 Dec 21:30
Compare
Choose a tag to compare

What's new?

Patch for release 2.12.1, making @huggingface/jinja a dependency instead of a peer dependency. This also means apply_chat_template is now synchronous (and does not lazily load the module). In future, we may want to add this functionality, but for now, it causes issues with lazy loading from a CDN.

code

code
import { AutoTokenizer } from "@xenova/transformers";

// Load tokenizer from the Hugging Face Hub
const tokenizer = await AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1");

// Define chat messages
const chat = [
  { role: "user", content: "Hello, how are you?" },
  { role: "assistant", content: "I'm doing great. How can I help you today?" },
  { role: "user", content: "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, ...]

Full Changelog: 2.12.0...2.12.1

2.12.0

18 Dec 16:13
Compare
Choose a tag to compare

What's new?

💬 Chat templates!

This release adds support for chat templates, a highly-requested feature that enables users to convert conversations (represented as a list of chat objects) into a single tokenizable string, in the format that the model expects. As you may know, chat templates can vary greatly across model types, so it was important to design a system that: (1) supports complex chat templates; (2) is generalizable, and (3) is easy to use. So, how did we do it? 🤔

This is made possible with @huggingface/jinja, a minimalistic JavaScript implementation of the Jinja templating engine, that we created to align with how transformers handles templating. Although it was originally designed for parsing and rendering ChatML templates, we decided to separate out the templating logic into an external (optional) library due to its usefulness in other types of applications. Special thanks to @tlaceby for his amazing "Guide to Interpreters" series, which provided the basis for our implementation. 🤗

Anyway, let's take a look at an example:

import { AutoTokenizer } from "@xenova/transformers";

// Load tokenizer from the Hugging Face Hub
const tokenizer = await AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1");

// Define chat messages
const chat = [
  { role: "user", content: "Hello, how are you?" },
  { role: "assistant", content: "I'm doing great. How can I help you today?" },
  { role: "user", content: "I'd like to show off how chat templating works!" },
]

const text = tokenizer.apply_chat_template(chat, { tokenize: false });
// "<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

Notice how the entire chat is condensed into a single string. If you would instead like to return the tokenized version (i.e., a list of token IDs), you can use the following:

const input_ids = tokenizer.apply_chat_template(chat, { tokenize: true, return_tensor: false });
// [1, 733, 16289, 28793, 22557, 28725, 910, 460, 368, 28804, 733, 28748, 16289, 28793, 28737, 28742, 28719, 2548, 1598, 28723, 1602, 541, 315, 1316, 368, 3154, 28804, 2, 28705, 733, 16289, 28793, 315, 28742, 28715, 737, 298, 1347, 805, 910, 10706, 5752, 1077, 3791, 28808, 733, 28748, 16289, 28793]

For more information about chat templates, check out the transformers documentation.

🐛 Bug fixes

  • Incorrect encoding/decoding of whitespace around special characters with Fast Llama tokenizers. These bugs will also soon be fixed in the transformers library. For backwards compatibility reasons, if the tokenizer was exported with the legacy behaviour, it will still act in the same way unless explicitly set otherwise. Newer exports won't be affected. If you wish to override this default, to either still use the legacy behaviour (for backwards compatibility reasons), or to upgrade to the fixed version, you can do so with:

    // Use the default behaviour (specified in tokenizer_config.json, which in the case is `{legacy: false}`).
    const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama2-tokenizer');
    const { input_ids } = tokenizer('<s>\n', { add_special_tokens: false, return_tensor: false });
    console.log(input_ids); // [1, 13]
    
    // Use the legacy behaviour
    const tokenizer = await AutoTokenizer.from_pretrained('Xenova/llama2-tokenizer', { legacy: true });
    const { input_ids } = tokenizer('<s>\n', { add_special_tokens: false, return_tensor: false });
    console.log(input_ids); // [1, 29871, 13]
  • Strip whitespace around special tokens for wav2vec tokenizers.

🔨 Improvements

  • More comprehensive tokenizer test suite: including both static and dynamic tokenizer tests for encoding, decoding, and chat templates.

Full Changelog: 2.11.0...2.12.0

2.11.0

13 Dec 14:09
Compare
Choose a tag to compare

What's new?

🤯 8 new architectures!

This release adds support for a bunch of new model architectures, covering a wide range of use cases! In total, we now support 73 different model architectures!

1. ViTMatte for image matting (#448). See here for the list of available models.

Example: Image matting w/ Xenova/vitmatte-small-distinctions-646.

import { AutoProcessor, VitMatteForImageMatting, RawImage } from '@xenova/transformers';

// Load processor and model
const processor = await AutoProcessor.from_pretrained('Xenova/vitmatte-small-distinctions-646');
const model = await VitMatteForImageMatting.from_pretrained('Xenova/vitmatte-small-distinctions-646');

// Load image and trimap
const image = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_image.png');
const trimap = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_trimap.png');

// Prepare image + trimap for the model
const inputs = await processor(image, trimap);

// Predict alpha matte
const { alphas } = await model(inputs);
// Tensor {
//   dims: [ 1, 1, 640, 960 ],
//   type: 'float32',
//   size: 614400,
//   data: Float32Array(614400) [ 0.9894027709960938, 0.9970508813858032, ... ]
// }
Visualization code
import { Tensor, cat } from '@xenova/transformers';

// Visualize predicted alpha matte
const imageTensor = new Tensor(
  'uint8',
  new Uint8Array(image.data),
  [image.height, image.width, image.channels]
).transpose(2, 0, 1);

// Convert float (0-1) alpha matte to uint8 (0-255)
const alphaChannel = alphas
  .squeeze(0)
  .mul_(255)
  .clamp_(0, 255)
  .round_()
  .to('uint8');

// Concatenate original image with predicted alpha
const imageData = cat([imageTensor, alphaChannel], 0);

// Save output image
const outputImage = RawImage.fromTensor(imageData);
outputImage.save('output.png');

Inputs:

Image Trimap
vitmatte_image vitmatte_trimap

Outputs:

Quantized Unquantized
output_quantized output_unquantized

2. ESM for protein sequence feature-extraction, masked language modelling, token classification, and zero-shot classification (#447). See here for the list of available models.

Example: Protein sequence classification w/ Xenova/esm2_t6_8M_UR50D_sequence_classifier_v1.

import { pipeline } from '@xenova/transformers';

// Create text classification pipeline
const classifier = await pipeline('text-classification', 'Xenova/esm2_t6_8M_UR50D_sequence_classifier_v1');

// Suppose these are your new sequences that you want to classify
// Additional Family 0: Enzymes
const new_sequences_0 = [ 'ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK', 'GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP', 'VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG', 'TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK', 'GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG', 'PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG', 'VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA', 'CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT', 'ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK', 'AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR' ]

// Additional Family 1: Receptor Proteins
const new_sequences_1 = [ 'VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD', 'KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS', 'PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG', 'CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR', 'RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT', 'RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY', 'RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP', 'LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV', 'RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK', 'QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY' ]

// Additional Family 2: Structural Proteins
const new_sequences_2 = [ 'VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT', 'KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK', 'PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD', 'CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS', 'RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS', 'RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP', 'RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS', 'LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV', 'RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK', 'QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ' ]

// Merge all sequences
const new_sequences = [...new_sequences_0, ...new_sequences_1, ...new_sequences_2];

// Get the predicted class for each sequence
const predictions = await classifier(new_sequences);

// Output the predicted class for each sequence
for (let i = 0; i < predictions.length; ++i) {
    console.log(`Sequence: ${new_sequences[i]}, Predicted class: '${predictions[i].label}'`)
}
// Sequence: ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK, Predicted class: 'Enzymes'
// ... (truncated)
// Sequence: AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR, Predicted class: 'Enzymes'
// Sequence: VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD, Predicted class: 'Receptor Proteins'
// ... (truncated)
// Sequence: QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY, Predicted class: 'Receptor Proteins'
// Sequence: VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT, Predicted class: 'Structural Proteins'
// ... (truncated)
// Sequence: QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ, Predicted class: 'Structural Proteins'

3. Hubert for audio classification, and automatic speech recognition (#449). See here for the list of available models.

Example: Speech command recognition w/ Xenova/hubert-base-superb-ks.

import { pipeline } from '@xenova/transformers';

// Create audio classification pipeline
const classifier = await pipeline('audio-classification', 'Xenova/hubert-base-superb-ks');

// Classify audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speech-commands_down.wav';
const output = await classifier(url, { topk: 5 });
// [
//   { label: 'down', score: 0.9954305291175842 },
//   { label: 'go', score: 0.004518700763583183 },
//   { label: '_unknown_', score: 0.00005029444946558215 },
//   { label: 'no', score: 4.877569494965428e-7 },
//   { label: 'stop', score: 5.504634081887616e-9 }
// ]

Example: Perform automatic speech recognition w/ Xenova/hubert-large-ls960-ft.

import { pipeline } from '@xenova/transformers';

// Create automatic speech recognition pipeline
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/hubert-large-ls960-ft');

// Transcribe audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: 'AND SO MY FELLOW AMERICA ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY' }

4. Chinese-CLIP for zero-shot image classification (#455). See here for the list of available models.

Example: Zero-shot image classification w/ Xenova/hubert-large-ls960-ft.

import { pipeline } from '@xenova/transformers';

// Create zero-shot image classification pipeline
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/chinese-clip-vit-base-patch16');

// Set image url and candidate labels
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/pikachu.png';
const candidate_labels = ['杰尼龟', '妙蛙种子', '小火龙', '皮卡丘'] // Squirtle, Bulbasaur, Charmander, Pikachu in Chinese

// Classify image
const output = await classifier(url, candidate_labels);
console.log(output);
// [
//   { score: 0.9926728010177612, label: '皮卡丘' },        // Pikachu
//   { score: 0.003480620216578245, label: '妙蛙种子' },    // Bulbasaur
//   { score: 0.001942147733643651, label: '杰尼龟' },      // Squirtle
//   { score: 0.0019044597866013646, label: '小火龙' }      // Charmander
// ]

5. DINOv2 for image classification (#444). See here for the list of available models.

Example: Image classification w/ Xenova/dinov2-small-imagenet1k-1-layer.

import { pipeline} from '@xenova/transformers';

// Create image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dinov2-small-imagenet1k-1-layer');

// Classify an image
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url);
console.log(output)
// [{ label: 'tabby, tabby cat', score: 0.8088238835334778 }]

6. ConvBERT for feature extraction (#445). See [here](https://huggingface.co/models?library=transforme...

Read more

2.10.1

06 Dec 17:21
Compare
Choose a tag to compare

What's new?

🐛 Bug fixes

  • Fix zero-shot-object-detection {percentage: true} in #434. Thanks to @tobiascornille for reporting the issue!

🛠️ Misc. improvements

  • Documentation improvements and new GitHub issues templates in #299
  • Standardize HF_ACCESS_TOKEN -> HF_TOKEN environment variables in #431

Full Changelog: 2.10.0...2.10.1

2.10.0

05 Dec 14:09
Compare
Choose a tag to compare

What's new?

🎵 New task: Zero-shot audio classification

The task of classifying audio into classes that are unseen during training. See here for more information.

Example: Perform zero-shot audio classification with Xenova/clap-htsat-unfused.

import { pipeline } from '@xenova/transformers';

// Create a zero-shot audio classification pipeline
const classifier = await pipeline('zero-shot-audio-classification', 'Xenova/clap-htsat-unfused');

const audio = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/dog_barking.wav';
const candidate_labels = ['dog', 'vaccum cleaner'];
const scores = await classifier(audio, candidate_labels);
// [
//   { score: 0.9993992447853088, label: 'dog' },
//   { score: 0.0006007603369653225, label: 'vaccum cleaner' }
// ]
Audio used
dog_barking.webm

💻 New architectures: CLAP, Audio Spectrogram Transformer, ConvNeXT, and ConvNeXT-v2

We added support for 4 new architectures, bringing the total up to 65!

  1. CLAP for zero-shot audio classification, text embeddings, and audio embeddings (#427). See here for the list of available models.

    • Zero-shot audio classification (same as above)

    • Text embeddings with Xenova/clap-htsat-unfused:

      import { AutoTokenizer, ClapTextModelWithProjection } from '@xenova/transformers';
      
      // Load tokenizer and text model
      const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clap-htsat-unfused');
      const text_model = await ClapTextModelWithProjection.from_pretrained('Xenova/clap-htsat-unfused');
      
      // Run tokenization
      const texts = ['a sound of a cat', 'a sound of a dog'];
      const text_inputs = tokenizer(texts, { padding: true, truncation: true });
      
      // Compute embeddings
      const { text_embeds } = await text_model(text_inputs);
      // Tensor {
      //   dims: [ 2, 512 ],
      //   type: 'float32',
      //   data: Float32Array(1024) [ ... ],
      //   size: 1024
      // }
    • Audio embeddings with Xenova/clap-htsat-unfused:

      import { AutoProcessor, ClapAudioModelWithProjection, read_audio } from '@xenova/transformers';
      
      // Load processor and audio model
      const processor = await AutoProcessor.from_pretrained('Xenova/clap-htsat-unfused');
      const audio_model = await ClapAudioModelWithProjection.from_pretrained('Xenova/clap-htsat-unfused');
      
      // Read audio and run processor
      const audio = await read_audio('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cat_meow.wav');
      const audio_inputs = await processor(audio);
      
      // Compute embeddings
      const { audio_embeds } = await audio_model(audio_inputs);
      // Tensor {
      //   dims: [ 1, 512 ],
      //   type: 'float32',
      //   data: Float32Array(512) [ ... ],
      //   size: 512
      // }
  2. Audio Spectrogram Transformer for audio classification (#427). See here for the list of available models.

    import { pipeline } from '@xenova/transformers';
    
    // Create an audio classification pipeline
    const classifier = await pipeline('audio-classification', 'Xenova/ast-finetuned-audioset-10-10-0.4593');
    
    // Predict class
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cat_meow.wav';
    const output = await classifier(url, { topk: 4 });
    // [
    //   { label: 'Meow', score: 0.5617874264717102 },
    //   { label: 'Cat', score: 0.22365376353263855 },
    //   { label: 'Domestic animals, pets', score: 0.1141069084405899 },
    //   { label: 'Animal', score: 0.08985692262649536 },
    // ]
  3. ConvNeXT for image classification (#428). See here for the list of available models.

    import { pipeline } from '@xenova/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'Xenova/convnext-tiny-224');
    
    // Classify an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
    const output = await classifier(url);
    // [{ label: 'tiger, Panthera tigris', score: 0.6153212785720825 }]
  4. ConvNeXT-v2 for image classification (#428). See here for the list of available models.

    import { pipeline } from '@xenova/transformers';
    
    // Create image classification pipeline
    const classifier = await pipeline('image-classification', 'Xenova/convnextv2-atto-1k-224');
    
    // Classify an image
    const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
    const output = await classifier(url);
    // [{ label: 'tiger, Panthera tigris', score: 0.6391205191612244 }]

🔨 Other improvements

  • Support decoding of tensors in #416

Full Changelog: 2.9.0...2.10.0

2.9.0

21 Nov 14:00
Compare
Choose a tag to compare

What's new?

😍 Exciting new tasks!

Transformers.js v2.9.0 adds support for three new tasks: (1) Depth estimation, (2) Zero-shot object detection, and (3) Optical document understanding.

🕵️‍♂️ Depth Estimation

The task of predicting the depth of objects present in an image. See here for more information.

import { pipeline } from '@xenova/transformers';

// Create depth estimation pipeline
let depth_estimator = await pipeline('depth-estimation', 'Xenova/dpt-hybrid-midas');

// Predict depth for image
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
let output = await depth_estimator(url);
Input Output
input output
Raw output
// {
//   predicted_depth: Tensor {
//     dims: [ 384, 384 ],
//     type: 'float32',
//     data: Float32Array(147456) [ 542.859130859375, 545.2833862304688, 546.1649169921875, ... ],
//     size: 147456
//   },
//   depth: RawImage {
//     data: Uint8Array(307200) [ 86, 86, 86, ... ],
//     width: 640,
//     height: 480,
//     channels: 1
//   }
// }

🎯 Zero-shot Object Detection

The task of identifying objects of classes that are unseen during training. See here for more information.

import { pipeline } from '@xenova/transformers';

// Create zero-shot object detection pipeline
let detector = await pipeline('zero-shot-object-detection', 'Xenova/owlvit-base-patch32');

// Predict bounding boxes
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/astronaut.png';
let candidate_labels = ['human face', 'rocket', 'helmet', 'american flag'];
let output = await detector(url, candidate_labels);

image

Raw output
// [
//   {
//     score: 0.24392342567443848,
//     label: 'human face',
//     box: { xmin: 180, ymin: 67, xmax: 274, ymax: 175 }
//   },
//   {
//     score: 0.15129457414150238,
//     label: 'american flag',
//     box: { xmin: 0, ymin: 4, xmax: 106, ymax: 513 }
//   },
//   {
//     score: 0.13649864494800568,
//     label: 'helmet',
//     box: { xmin: 277, ymin: 337, xmax: 511, ymax: 511 }
//   },
//   {
//     score: 0.10262022167444229,
//     label: 'rocket',
//     box: { xmin: 352, ymin: -1, xmax: 463, ymax: 287 }
//   }
// ]

📝 Optical Document Understanding (image-to-text)

This task involves translating images of scientific PDFs to markdown, enabling easier access to them. See here for more information.

import { pipeline } from '@xenova/transformers';

// Create image-to-text pipeline
let pipe = await pipeline('image-to-text', 'Xenova/nougat-small');

// Generate markdown
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/nougat_paper.png';
let output = await pipe(url, {
  min_length: 1,
  max_new_tokens: 40,
  bad_words_ids: [[pipe.tokenizer.unk_token_id]],
});
// [{ generated_text: "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucur" }]
See input image

image

💻 New architectures: Nougat, DPT, GLPN, OwlViT

We added support for 4 new architectures, bringing the total up to 61!

  • DPT for depth estimation. See here for the list of available models.
  • GLPN for depth estimation. See here for the list of available models.
  • OwlViT for zero-shot object detection. See here for the list of available models.
  • Nougat for optical understanding of academic documents (image-to-text). See here for the list of available models.

🔨 Other improvements

  • Add support for Grouped Query Attention on Llama Model by @felladrin in #393
  • Implement max character check by @samlhuillier in #398
  • Add CLIPFeatureExtractor (and tests) in #387
  • Add jsDelivr stats to README in #395
  • Update sharp dependency version in #400

🐛 Bug fixes

  • Move tensor clone to fix Worker ownership NaN issue by @kungfooman in #404
  • Add default token_type_ids for multilingual-e5-* models by @do-me in #403
  • Ensure WASM fallback does not crash in GH actions in #402

🤗 New contributors

Full Changelog: 2.8.0...2.9.0

2.8.0

09 Nov 16:53
Compare
Choose a tag to compare

What's new?

🖼️ New task: Image-to-image

This release adds support for image-to-image translation (e.g., super-resolution) with Swin2SR models.

Side-by-side (full) Animated (zoomed)
side-by-side animated-comparison

As always, you can get started in just a few lines of code!

import { pipeline } from '@xenova/transformers';

let url = 'https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/testsets/real-inputs/0855.jpg';
let upscaler = await pipeline('image-to-image', 'Xenova/swin2SR-compressed-sr-x4-48');
let output = await upscaler(url);
// RawImage {
//   data: Uint8Array(12582912) [165, 166, 163, ...],
//   width: 2048,
//   height: 2048,
//   channels: 3
// }

💻 New architectures: TrOCR, Swin2SR, Mistral, and Falcon

We also added support for 4 new architectures, bringing the total up to 57! 🤯

  • TrOCR for optical character recognition (OCR).

    import { pipeline } from '@xenova/transformers';
    
    let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/handwriting.jpg';
    let captioner = await pipeline('image-to-text', 'Xenova/trocr-small-handwritten');
    let output = await captioner(url);
    // [{ generated_text: 'Mr. Brown commented icily.' }]

    image

    Added in #375. See here for the list of available models.

  • Swin2SR for super-resolution and image restoration.

    import { pipeline } from '@xenova/transformers';
    
    let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/butterfly.jpg';
    let upscaler = await pipeline('image-to-image', 'Xenova/swin2SR-classical-sr-x2-64');
    let output = await upscaler(url);
    // RawImage {
    //   data: Uint8Array(786432) [ 41, 31, 24,  43, ... ],
    //   width: 512,
    //   height: 512,
    //   channels: 3
    // }

    Added in #381. See here for the list of available models.

  • Mistral and Falcon for text-generation. Added in #379.
    Note: Other than testing models, we haven't yet converted any of the larger (≥7B parameter) models. Stay tuned for more updates on this!

🐛 Bug fixes:

  • By default, do not add special tokens at start of text-generation (see commit)
  • Fix Firefox bug when displaying progress events while reading file from browser cache in #374. Thanks to @felladrin for reporting this issue!
  • Fix text2text-generation pipeline output inconsistency w/ python library in #384

🔨 Minor improvements:

  • Upgrade typescript dependency version by @Kit-p in #368
  • Improve docs in #385

🤗 New Contributors

Full Changelog: 2.7.0...2.8.0