You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A model's tokenizer might require some special tokens which might not be already present in the tokenizer's vocabulary.
We require to add these tokens to the vocabulary which requires us make changes through code using the tokenizer.add_special_tokens() method and resize model's token embeddings for new vocab size.
Alternatively if the model has Reserved Special Tokens available we can modify them to add these Special tokens without resizing the model's token embeddings.
Describe the solution you'd like
Allow user to provide a list of special tokens to be added to tokenizer's vocab though CLI where we can then figure out to use one of the above stated methods to to add the tokens
Additional context
Working with Chat Templates
Chat templates have special control tokens ( ex : <|start_of_role|>,<|assistant|>,<|sys|>,etc ) which are essential for formatting and differentiating various chat messages from one another. These special control tokens can vary from across various chat templates.
Issue Faced: Special Tokens are broken down into sub tokens
If the Special Tokens used in the chat-template are not present in the Tokenizer's Vocabulary the tokenizer will tokenize the Special token into sub tokens which might cause issues for the model to comprehend the data format correctly.
ex: ["<|assistant|>"] could be broken up into [ "<|" , "assistant" , ">|" ]
Solution
We can solve the problem by simply adding Special Tokens in tokenizer's vocabulary and we can do that in the following ways:
we can add special tokens using add_special_tokens method and resize the model token embeddings to work with new vocab size.
if the model has Reserved_Special_Tokens (ex. Llama models), we can then replace those tokens in tokenizer.json and tokenizer_config.json and then directly load tokenizer from new files without resizing model's token embeddings.
Feature Request
Support to add special tokens through CLI.
If Reserved_Special_Tokens exist in a model's tokenizer, we can use them to add Special Tokens in the Tokenizer's Vocabulary. This functionality is not supported by fms-hf-tuning stack.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
A model's tokenizer might require some special tokens which might not be already present in the tokenizer's vocabulary.
We require to add these tokens to the vocabulary which requires us make changes through code using the
tokenizer.add_special_tokens()
method and resize model's token embeddings for new vocab size.Alternatively if the model has Reserved Special Tokens available we can modify them to add these Special tokens without resizing the model's token embeddings.
Describe the solution you'd like
Allow user to provide a list of special tokens to be added to tokenizer's vocab though CLI where we can then figure out to use one of the above stated methods to to add the tokens
Additional context
Working with Chat Templates
Chat templates have special control tokens ( ex : <|start_of_role|>,<|assistant|>,<|sys|>,etc ) which are essential for formatting and differentiating various chat messages from one another. These special control tokens can vary from across various chat templates.
Issue Faced: Special Tokens are broken down into sub tokens
If the Special Tokens used in the chat-template are not present in the Tokenizer's Vocabulary the tokenizer will tokenize the Special token into sub tokens which might cause issues for the model to comprehend the data format correctly.
ex: ["<|assistant|>"] could be broken up into [ "<|" , "assistant" , ">|" ]
Solution
We can solve the problem by simply adding Special Tokens in tokenizer's vocabulary and we can do that in the following ways:
add_special_tokens
method and resize the model token embeddings to work with new vocab size.Feature Request
Support to add special tokens through CLI.
If Reserved_Special_Tokens exist in a model's tokenizer, we can use them to add Special Tokens in the Tokenizer's Vocabulary. This functionality is not supported by
fms-hf-tuning
stack.The text was updated successfully, but these errors were encountered: