-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
EpicHigh-level theme containing smaller storiesHigh-level theme containing smaller storiesenhancementNew feature or requestNew feature or request
Milestone
Description
Goal: Enable .NET developers to use tokenizers in their data pre-processing pipelines as part of their embedding and token generation tasks using language models.
Committed:
- Add support for more commonly used Tokenizers
- TikToken Introducing Tiktoken Tokenizer #6981
- LlamaTokenizer & SentencePiece algorithm [Tokenizers] Port LLaMA Tokenizer and SentencePiece algorithm #6987
- CodeGenTokenizer & Byte-level BPE [Tokenizers] Port CodeGenTokenizer & byte-level BPE algorithm #6992
- WordPiece algorithm [Tokenizers] Implement WordPiece algorithm #6988
- BERTTokenizer [Tokenizers] Port BERTTokenizers #6991
- Measure and improve performance of Tokenizers API - making breaking changes where necessary. (Track Tokenizers design feedback #6982)
- Explore existing construction patterns to improve usability - both in factory API and load from configuration.
- Drive adoption of Microsoft.ML.Tokenizers in other libraries
- Docs and samples
Backlog:
- Investigate using Microsoft.ML.Tokenizers in Azure OpenAI SDK
- Sentencepiece Unigram Implement Sentencepiece Unigram tokenizer #7186
- CLIP Tokenizer [Tokenizers] Port CLIP Tokenizer #6993
asmirnov82 and arthurvb
Metadata
Metadata
Assignees
Labels
EpicHigh-level theme containing smaller storiesHigh-level theme containing smaller storiesenhancementNew feature or requestNew feature or request