Skip to content

Deferred: calibrate mojibake repair and Cyrillic/homoglyph handling before cleaner changes #99

Description

@fffoivos

Context

Wave-2 tokenizer/corpus review found residual tokens and corpus spans that look like mojibake or script/homoglyph noise, but the examples also show legitimate multilingual content. This should not be folded into the immediate wave-3 cleaner patch without a false-positive test set.

Evidence from wave-2 review

  • F1/F2 vocab both contain tokens such as Î, Ï, ï, ïõ, ïí, and Cyrillic homoglyph-looking tokens (а, е, о, р, с, etc.).
  • Full F1 train scan over 310,019 rows / 60.8B chars found:
    • mojibake_marker: 27,861 docs (8.99%)
    • cyrillic_any: 8,267 docs (2.67%)
  • Sample contexts include legitimate European/multilingual text and names, e.g. naïve, Romanian Întocmit, French/Spanish/Polish strings, EU multilingual boilerplate, names with Cyrillic-like letters, and genuine Cyrillic paragraphs.

Proposed future work

  1. Build a small false-positive/true-positive calibration set before implementing any rewrite.
  2. For mojibake, only repair where Latin-1/UTF-8 reinterpretation confidently recovers Greek text; do not blanket-strip marker chars like Î, Ï, ï, or â.
  3. For Cyrillic/homoglyph handling, prefer word-local majority-script folding only after testing; do not globally strip Cyrillic.
  4. Keep this out of the near-term wave-3 cleanup, which should focus on high-confidence run quantization, escaped markdown runs, and bounded glyph residue stripping.

Local analysis artifacts

Generated during the 2026-04-28 review in the tokenizer-extension workspace:

  • tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/vocab_bad_token_report.md
  • tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/glossapi_only_train_full_mp_v2.md

These artifacts are local/ignored analysis outputs, not necessarily committed to either repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions