Unexpected Ċ, Ġ and D characters with Llama 3.3 Instruct 70b #142

lemmi · 2024-12-09T10:00:50Z

This has been reported before in #99, but the suggested fix to replace all characters causes an error:

Traceback (most recent call last):
  File "/home/llama/distributed-llama/converter/convert-tokenizer-hf.py", line 83, in <module>
    resolver.resolve()
  File "/home/llama/distributed-llama/converter/convert-tokenizer-hf.py", line 61, in resolve
    return self.resolvePreTrainedTokenizerFast()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/distributed-llama/converter/convert-tokenizer-hf.py", line 26, in resolvePreTrainedTokenizerFast
    assert(tokenizer['model']['vocab'][token] == i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Using a tokenizer from another llama model seems to work, but I'm not sure it actually maps correctly in every case.

One thing I found was this:

Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).

b4rtaz · 2024-12-09T11:18:12Z

Hello @lemmi,

this is a well-known problem. So far, I have solved it by manually replacing characters in the source file (like Ġ => ) before the convertion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Ċ, Ġ and D characters with Llama 3.3 Instruct 70b #142

Unexpected Ċ, Ġ and D characters with Llama 3.3 Instruct 70b #142

lemmi commented Dec 9, 2024

b4rtaz commented Dec 9, 2024

Unexpected Ċ, Ġ and D characters with Llama 3.3 Instruct 70b #142

Unexpected Ċ, Ġ and D characters with Llama 3.3 Instruct 70b #142

Comments

lemmi commented Dec 9, 2024

b4rtaz commented Dec 9, 2024