Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: refactor set special tokens function and add unit tests. #475

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Luka-D
Copy link
Contributor

@Luka-D Luka-D commented Feb 18, 2025

Description of the change

Making a draft PR for everyone's thoughts on unit tests so far.

Missing Pad Token, LlamaTokenizerFast:

The first test uses a LlamaTokenizerFast tokenizer. This tokenizer is only missing a PAD token, however because it is a LlamaTokenizer, the function code automatically adds the bos, eos, unk and pad tokens to the special tokens dict. Then, the <pad> token is replaced with a <PAD> token, because the Llama tokenizer does not have a pad token specified.

EOS = PAD:

The second test uses a GPT2TokenizerFasttokenizer. This tokenizer is the case where the EOS token = PAD token, both of them are <|endoftext|>. So, the pad token in the tokenizer is set to <PAD> and the "pad_token": "<PAD>" is also added to the special tokens dict.

Missing Pad Token:

The third test uses a GPTNeoXTokenizerFasttokenizer. This tokenizer is another one that is hardcoded into the function to automatically add just a pad token to the special tokens dict. However, the tokenizer itself is also missing a pad token, so the function then replaces the <pad> token with the default <PAD> token.

Missing all tokens:

Added in 781ce58. This uses the IBM Granite tokenizer and removes all special tokens. The result is that the special tokens dict contains the PAD, EOS, BOS and UNK tokens.

Related issue number

Related to Issue #1515

How to verify the PR

You can run:
tox -e py -- tests/utils/test_tokenizer_data_utils.py

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@Luka-D Luka-D changed the title Fix: refactor set special tokens function and add unit tests. fix: refactor set special tokens function and add unit tests. Feb 18, 2025
@github-actions github-actions bot added the fix label Feb 18, 2025
Refactored the set special tokens code to it's own function in the tuning.utils.tokenizer_data_utils file. Imported the new function into sft_trainer and called it to set special tokens.

Signed-off-by: Luka Dojcinovic <[email protected]>
Added unit tests for set_special_tokens_dict() for LlamaTokenizerFast, GPT2TokenizerFast and GPTNeoXTokenizerFast.

Signed-off-by: Luka Dojcinovic <[email protected]>
@Luka-D Luka-D force-pushed the fix-refactor-special-tokens branch from 42a9c32 to 9cb8af8 Compare February 18, 2025 18:34
Minor fix to remove print statements

Signed-off-by: Luka Dojcinovic <[email protected]>
@Luka-D Luka-D marked this pull request as draft February 18, 2025 18:39
Also changed the conditional for matching eos and pad tokens to not work when both are None.

Signed-off-by: Luka Dojcinovic <[email protected]>
@Luka-D
Copy link
Contributor Author

Luka-D commented Feb 18, 2025

With commit 781ce58, I have added a new unit test for when a tokenizer has no special tokens. I figured it was easier to test a tokenizer that was missing all the tokens, instead of doing 4 unit tests for missing EOS, BOS, PAD etc. I can split it into 4 unit tests if you feel that is a better idea.

I also changed the conditional for the case of EOS = PAD token. Because both tokens are set to None, the conditional tokenizer.pad_token == tokenizer.eos_token would evaluate to True. I am unsure if we want to have the warning appear when both tokens are None, since I wouldn't really say they are identical, and the defaults that are added in the code above make them different anyways (PAD is set to <PAD> and EOS is set to </s>).
I have added if (tokenizer.pad_token is not None and tokenizer.pad_token == tokenizer.eos_token )to fix this.

Please let me know your thoughts on this. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant