You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).
The text was updated successfully, but these errors were encountered:
This has been reported before in #99, but the suggested fix to replace all characters causes an error:
Using a tokenizer from another llama model seems to work, but I'm not sure it actually maps correctly in every case.
One thing I found was this:
The text was updated successfully, but these errors were encountered: