fix: refactor set special tokens function and add unit tests. #475
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the change
Making a draft PR for everyone's thoughts on unit tests so far.
Missing Pad Token, LlamaTokenizerFast:
The first test uses a
LlamaTokenizerFast
tokenizer. This tokenizer is only missing a PAD token, however because it is aLlamaTokenizer
, the function code automatically adds the bos, eos, unk and pad tokens to the special tokens dict. Then, the<pad>
token is replaced with a<PAD>
token, because the Llama tokenizer does not have a pad token specified.EOS = PAD:
The second test uses a
GPT2TokenizerFast
tokenizer. This tokenizer is the case where the EOS token = PAD token, both of them are<|endoftext|>
. So, the pad token in the tokenizer is set to<PAD>
and the"pad_token": "<PAD>"
is also added to the special tokens dict.Missing Pad Token:
The third test uses a
GPTNeoXTokenizerFast
tokenizer. This tokenizer is another one that is hardcoded into the function to automatically add just a pad token to the special tokens dict. However, the tokenizer itself is also missing a pad token, so the function then replaces the<pad>
token with the default<PAD>
token.Missing all tokens:
Added in 781ce58. This uses the IBM Granite tokenizer and removes all special tokens. The result is that the special tokens dict contains the PAD, EOS, BOS and UNK tokens.
Related issue number
Related to Issue #1515
How to verify the PR
You can run:
tox -e py -- tests/utils/test_tokenizer_data_utils.py
Was the PR tested