corpus.clean / clean_ocr — audit LaTeX handling for math-enhancer outputs and unify cleanup ideas across the two cleaning paths

## Background

There are currently TWO MD-cleaning entry points with distinct Rust backends and somewhat-independent ideas:

1. **`corpus.clean`** — `Corpus.clean()` in `src/glossapi/corpus/phase_clean.py`, backed by `rust/glossapi_rs_cleaner/` (the main cleaner). Owns LaTeX-aware repetition cropping in `rust/glossapi_rs_cleaner/src/latex_module.rs` (default `enable_latex_repetition_crop=true` after commit `8f0ce02`).
2. **`corpus.clean_ocr`** — `Corpus.clean_ocr()` at `phase_clean.py:3801`, backed by `rust/glossapi_rs_noise/`. Has its own page-loop + repetition / progression / digit / word-window detectors. Multiple debug variants (numeric, word, hybrid, latex_slot_progression, latex).

Meanwhile, `corpus.extract --math-enhance` (`src/glossapi/corpus/phase_ocr_math.py`) emits HTML-comment placeholders in the extracted MD; a math model OCRs the corresponding equations from the PDF and the comments are later replaced with real LaTeX (typically `\$\$ ... \$\$` blocks or inline `\$ ... \$`).

## Two concerns

### (1) Make sure `corpus.clean` handles math-enhancer outputs cleanly

The math-enhancer produces real LaTeX content. `corpus.clean`'s `crop_latex_repetitions` is designed for repetitive *garbage* OCR-LaTeX (degenerate `\\frac` chains, repeated chars, cyclic patterns). It must NOT spuriously crop legitimate model-OCR'd math.

Action items:

- **Audit current behaviour** of `latex_module.rs` cropping heuristics on a sample of math-enhancer outputs. Heuristics in scope:
  - `detect_repeated_char_cut`
  - `detect_repeated_lines_cut`
  - `detect_repeated_element_cut`
  - `detect_small_vocab_run_cut`
  - `detect_cyclic_element_cut`
  - `detect_unbalanced_braces_in_latex_span`
  - `detect_degenerate_frac_cut`
  - `detect_monotonic_element_cut`
- Build a small fixture set of \"real\" model-OCR LaTeX (long but valid equations: integrals, summations, matrix expansions, multi-line displaystyle math) and confirm \`enable_latex_repetition_crop=true\` does not damage them.
- If false positives surface, add provenance-based exception: \"if the LaTeX span came from a math-enhancer placeholder substitution, skip aggressive cropping\". This needs a marker — either preserve a stable comment ID on the replaced span, or track post-substitution spans during the replacement step.

Currently handled (per `latex_module.rs` doc-comment): multi-line `\$\$ ... \$\$` + single-line `\$\$ ... \$\$`. Deferred: inline `\$ ... \$`, `\\begin{env} ... \\end{env}` environments. The math enhancer may emit any of these forms — confirm coverage.

### (2) Bring `clean_ocr` up to date with `corpus.clean`'s LaTeX ideas

`clean_ocr` (Rust backend `glossapi_rs_noise`) and `clean` (Rust backend `glossapi_rs_cleaner`) developed in parallel and don't share their LaTeX-handling code. Concretely missing on the OCR side:

- The 8-detector cropping heuristic family from `latex_module.rs` (degenerate fracs, cyclic elements, monotonic runs, etc.).
- Phase A formatting (paragraph reflow / GFM table delimiter min / HR canonicalization) — relevant after Pilot B integration via `PhaseAMode` (see #97 and the parser-backed Phase A docs).
- The `non_destructive_canonicalize` shared baseline used by `corpus.clean`'s verifier.

Action items:

- Decide architectural split: keep `clean_ocr` as a separate path (different invariants for OCR'd vs Docling-extracted MD) but factor the COMMON LaTeX-cropping detectors into a shared crate / module that both `glossapi_rs_cleaner` and `glossapi_rs_noise` consume.
- Or merge: have `clean_ocr` call into `glossapi_rs_cleaner` with an OCR-profile config rather than maintain two binaries.
- Either way: ensure that when a corpus row passes through `clean_ocr` (e.g., math-enhanced rows), the LaTeX cropping detectors apply consistently with what `corpus.clean` would do.

## References

- Math placeholder line-number caveat: #97
- Phase A parser-backed candidate (Pilot B): `rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_INDEX.md`
- LaTeX module: `rust/glossapi_rs_cleaner/src/latex_module.rs`
- Cleaner Python entry: `src/glossapi/corpus/phase_clean.py:3801` (`clean_ocr`)
- Math-enhance Python entry: `src/glossapi/corpus/phase_ocr_math.py`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

corpus.clean / clean_ocr — audit LaTeX handling for math-enhancer outputs and unify cleanup ideas across the two cleaning paths #98

Background

Two concerns

(1) Make sure `corpus.clean` handles math-enhancer outputs cleanly

(2) Bring `clean_ocr` up to date with `corpus.clean`'s LaTeX ideas

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

corpus.clean / clean_ocr — audit LaTeX handling for math-enhancer outputs and unify cleanup ideas across the two cleaning paths #98

Description

Background

Two concerns

(1) Make sure corpus.clean handles math-enhancer outputs cleanly

(2) Bring clean_ocr up to date with corpus.clean's LaTeX ideas

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

(1) Make sure `corpus.clean` handles math-enhancer outputs cleanly

(2) Bring `clean_ocr` up to date with `corpus.clean`'s LaTeX ideas