feat: Support for adding Special Tokens #470

YashasviChaurasia · 2025-02-14T08:34:42Z

Is your feature request related to a problem? Please describe.

A model's tokenizer might require some special tokens which might not be already present in the tokenizer's vocabulary.
We require to add these tokens to the vocabulary which requires us make changes through code using the tokenizer.add_special_tokens() method and resize model's token embeddings for new vocab size.

Alternatively if the model has Reserved Special Tokens available we can modify them to add these Special tokens without resizing the model's token embeddings.

Describe the solution you'd like

Allow user to provide a list of special tokens to be added to tokenizer's vocab though CLI where we can then figure out to use one of the above stated methods to to add the tokens

Additional context

Working with Chat Templates

Chat templates have special control tokens ( ex : <|start_of_role|>,<|assistant|>,<|sys|>,etc ) which are essential for formatting and differentiating various chat messages from one another. These special control tokens can vary from across various chat templates.

Issue Faced: Special Tokens are broken down into sub tokens
If the Special Tokens used in the chat-template are not present in the Tokenizer's Vocabulary the tokenizer will tokenize the Special token into sub tokens which might cause issues for the model to comprehend the data format correctly.

ex: ["<|assistant|>"] could be broken up into [ "<|" , "assistant" , ">|" ]

Solution
We can solve the problem by simply adding Special Tokens in tokenizer's vocabulary and we can do that in the following ways:

we can add special tokens using add_special_tokens method and resize the model token embeddings to work with new vocab size.
if the model has Reserved_Special_Tokens (ex. Llama models), we can then replace those tokens in tokenizer.json and tokenizer_config.json and then directly load tokenizer from new files without resizing model's token embeddings.

Feature Request

Support to add special tokens through CLI.
If Reserved_Special_Tokens exist in a model's tokenizer, we can use them to add Special Tokens in the Tokenizer's Vocabulary. This functionality is not supported by fms-hf-tuning stack.

The text was updated successfully, but these errors were encountered:

kmehant · 2025-02-14T08:41:50Z

@YashasviChaurasia Looking forward to a draft PR.

kmehant assigned kmehant and YashasviChaurasia and unassigned kmehant Feb 14, 2025

YashasviChaurasia mentioned this issue Feb 18, 2025

feat: Support for add special tokens via cli args #473

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support for adding Special Tokens #470

feat: Support for adding Special Tokens #470

YashasviChaurasia commented Feb 14, 2025

kmehant commented Feb 14, 2025

feat: Support for adding Special Tokens #470

feat: Support for adding Special Tokens #470

Comments

YashasviChaurasia commented Feb 14, 2025

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Working with Chat Templates

Feature Request

kmehant commented Feb 14, 2025