Deferred: calibrate mojibake repair and Cyrillic/homoglyph handling before cleaner changes

## Context

Wave-2 tokenizer/corpus review found residual tokens and corpus spans that look like mojibake or script/homoglyph noise, but the examples also show legitimate multilingual content. This should not be folded into the immediate wave-3 cleaner patch without a false-positive test set.

## Evidence from wave-2 review

- F1/F2 vocab both contain tokens such as `Î`, `Ï`, `ï`, `ïõ`, `ïí`, and Cyrillic homoglyph-looking tokens (`а`, `е`, `о`, `р`, `с`, etc.).
- Full F1 train scan over `310,019` rows / `60.8B` chars found:
  - `mojibake_marker`: `27,861` docs (`8.99%`)
  - `cyrillic_any`: `8,267` docs (`2.67%`)
- Sample contexts include legitimate European/multilingual text and names, e.g. `naïve`, Romanian `Întocmit`, French/Spanish/Polish strings, EU multilingual boilerplate, names with Cyrillic-like letters, and genuine Cyrillic paragraphs.

## Proposed future work

1. Build a small false-positive/true-positive calibration set before implementing any rewrite.
2. For mojibake, only repair where Latin-1/UTF-8 reinterpretation confidently recovers Greek text; do not blanket-strip marker chars like `Î`, `Ï`, `ï`, or `â`.
3. For Cyrillic/homoglyph handling, prefer word-local majority-script folding only after testing; do not globally strip Cyrillic.
4. Keep this out of the near-term wave-3 cleanup, which should focus on high-confidence run quantization, escaped markdown runs, and bounded glyph residue stripping.

## Local analysis artifacts

Generated during the 2026-04-28 review in the tokenizer-extension workspace:

- `tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/vocab_bad_token_report.md`
- `tokenizer_analysis/inspection/wave2_bad_token_analysis_20260428/glossapi_only_train_full_mp_v2.md`

These artifacts are local/ignored analysis outputs, not necessarily committed to either repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deferred: calibrate mojibake repair and Cyrillic/homoglyph handling before cleaner changes #99

Context

Evidence from wave-2 review

Proposed future work

Local analysis artifacts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Deferred: calibrate mojibake repair and Cyrillic/homoglyph handling before cleaner changes #99

Description

Context

Evidence from wave-2 review

Proposed future work

Local analysis artifacts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions