Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for adding Special Tokens #470

Open
YashasviChaurasia opened this issue Feb 14, 2025 · 1 comment
Open

feat: Support for adding Special Tokens #470

YashasviChaurasia opened this issue Feb 14, 2025 · 1 comment
Assignees

Comments

@YashasviChaurasia
Copy link
Contributor

Is your feature request related to a problem? Please describe.

A model's tokenizer might require some special tokens which might not be already present in the tokenizer's vocabulary.
We require to add these tokens to the vocabulary which requires us make changes through code using the tokenizer.add_special_tokens() method and resize model's token embeddings for new vocab size.

Alternatively if the model has Reserved Special Tokens available we can modify them to add these Special tokens without resizing the model's token embeddings.

Describe the solution you'd like

Allow user to provide a list of special tokens to be added to tokenizer's vocab though CLI where we can then figure out to use one of the above stated methods to to add the tokens

Additional context

Working with Chat Templates

Chat templates have special control tokens ( ex : <|start_of_role|>,<|assistant|>,<|sys|>,etc ) which are essential for formatting and differentiating various chat messages from one another. These special control tokens can vary from across various chat templates.

Issue Faced: Special Tokens are broken down into sub tokens
If the Special Tokens used in the chat-template are not present in the Tokenizer's Vocabulary the tokenizer will tokenize the Special token into sub tokens which might cause issues for the model to comprehend the data format correctly.

Image

ex: ["<|assistant|>"] could be broken up into [ "<|" , "assistant" , ">|" ]

Solution
We can solve the problem by simply adding Special Tokens in tokenizer's vocabulary and we can do that in the following ways:

  1. we can add special tokens using add_special_tokens method and resize the model token embeddings to work with new vocab size.
  2. if the model has Reserved_Special_Tokens (ex. Llama models), we can then replace those tokens in tokenizer.json and tokenizer_config.json and then directly load tokenizer from new files without resizing model's token embeddings.

Feature Request

Support to add special tokens through CLI.
If Reserved_Special_Tokens exist in a model's tokenizer, we can use them to add Special Tokens in the Tokenizer's Vocabulary. This functionality is not supported by fms-hf-tuning stack.

@kmehant kmehant assigned kmehant and YashasviChaurasia and unassigned kmehant Feb 14, 2025
@kmehant
Copy link
Collaborator

kmehant commented Feb 14, 2025

@YashasviChaurasia Looking forward to a draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants