Commit 2d219b3
authored
vocab : ignore invalid UTF-8 input in the BPE tokenizer (ggml-org#11729)
Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.
This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)
Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.1 parent 333820d commit 2d219b3
1 file changed
+8
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
618 | 618 | | |
619 | 619 | | |
620 | 620 | | |
621 | | - | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
622 | 629 | | |
623 | 630 | | |
624 | 631 | | |
| |||
0 commit comments