Context
Wave-2 tokenizer/corpus review found residual tokens and corpus spans that look like mojibake or script/homoglyph noise, but the examples also show legitimate multilingual content. This should not be folded into the immediate wave-3 cleaner patch without a false-positive test set.
Evidence from wave-2 review
- F1/F2 vocab both contain tokens such as
Î, Ï, ï, ïõ, ïí, and Cyrillic homoglyph-looking tokens (а, е, о, р, с, etc.).
- Full F1 train scan over
310,019 rows / 60.8B chars found:
mojibake_marker: 27,861 docs (8.99%)
cyrillic_any: 8,267 docs (2.67%)
- Sample contexts include legitimate European/multilingual text and names, e.g.
naïve, Romanian Întocmit, French/Spanish/Polish strings, EU multilingual boilerplate, names with Cyrillic-like letters, and genuine Cyrillic paragraphs.
Proposed future work
- Build a small false-positive/true-positive calibration set before implementing any rewrite.
- For mojibake, only repair where Latin-1/UTF-8 reinterpretation confidently recovers Greek text; do not blanket-strip marker chars like
Î, Ï, ï, or â.
- For Cyrillic/homoglyph handling, prefer word-local majority-script folding only after testing; do not globally strip Cyrillic.
- Keep this out of the near-term wave-3 cleanup, which should focus on high-confidence run quantization, escaped markdown runs, and bounded glyph residue stripping.
Local analysis artifacts
Generated during the 2026-04-28 review in the tokenizer-extension workspace:
tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/vocab_bad_token_report.md
tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/glossapi_only_train_full_mp_v2.md
These artifacts are local/ignored analysis outputs, not necessarily committed to either repo.
Context
Wave-2 tokenizer/corpus review found residual tokens and corpus spans that look like mojibake or script/homoglyph noise, but the examples also show legitimate multilingual content. This should not be folded into the immediate wave-3 cleaner patch without a false-positive test set.
Evidence from wave-2 review
Î,Ï,ï,ïõ,ïí, and Cyrillic homoglyph-looking tokens (а,е,о,р,с, etc.).310,019rows /60.8Bchars found:mojibake_marker:27,861docs (8.99%)cyrillic_any:8,267docs (2.67%)naïve, RomanianÎntocmit, French/Spanish/Polish strings, EU multilingual boilerplate, names with Cyrillic-like letters, and genuine Cyrillic paragraphs.Proposed future work
Î,Ï,ï, orâ.Local analysis artifacts
Generated during the 2026-04-28 review in the tokenizer-extension workspace:
tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/vocab_bad_token_report.mdtokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/glossapi_only_train_full_mp_v2.mdThese artifacts are local/ignored analysis outputs, not necessarily committed to either repo.