Skip to content

corpus.clean / clean_ocr — audit LaTeX handling for math-enhancer outputs and unify cleanup ideas across the two cleaning paths #98

Description

@fffoivos

Background

There are currently TWO MD-cleaning entry points with distinct Rust backends and somewhat-independent ideas:

  1. corpus.cleanCorpus.clean() in src/glossapi/corpus/phase_clean.py, backed by rust/glossapi_rs_cleaner/ (the main cleaner). Owns LaTeX-aware repetition cropping in rust/glossapi_rs_cleaner/src/latex_module.rs (default enable_latex_repetition_crop=true after commit 8f0ce02).
  2. corpus.clean_ocrCorpus.clean_ocr() at phase_clean.py:3801, backed by rust/glossapi_rs_noise/. Has its own page-loop + repetition / progression / digit / word-window detectors. Multiple debug variants (numeric, word, hybrid, latex_slot_progression, latex).

Meanwhile, corpus.extract --math-enhance (src/glossapi/corpus/phase_ocr_math.py) emits HTML-comment placeholders in the extracted MD; a math model OCRs the corresponding equations from the PDF and the comments are later replaced with real LaTeX (typically \$\$ ... \$\$ blocks or inline \$ ... \$).

Two concerns

(1) Make sure corpus.clean handles math-enhancer outputs cleanly

The math-enhancer produces real LaTeX content. corpus.clean's crop_latex_repetitions is designed for repetitive garbage OCR-LaTeX (degenerate \\frac chains, repeated chars, cyclic patterns). It must NOT spuriously crop legitimate model-OCR'd math.

Action items:

  • Audit current behaviour of latex_module.rs cropping heuristics on a sample of math-enhancer outputs. Heuristics in scope:
    • detect_repeated_char_cut
    • detect_repeated_lines_cut
    • detect_repeated_element_cut
    • detect_small_vocab_run_cut
    • detect_cyclic_element_cut
    • detect_unbalanced_braces_in_latex_span
    • detect_degenerate_frac_cut
    • detect_monotonic_element_cut
  • Build a small fixture set of "real" model-OCR LaTeX (long but valid equations: integrals, summations, matrix expansions, multi-line displaystyle math) and confirm `enable_latex_repetition_crop=true` does not damage them.
  • If false positives surface, add provenance-based exception: "if the LaTeX span came from a math-enhancer placeholder substitution, skip aggressive cropping". This needs a marker — either preserve a stable comment ID on the replaced span, or track post-substitution spans during the replacement step.

Currently handled (per latex_module.rs doc-comment): multi-line \$\$ ... \$\$ + single-line \$\$ ... \$\$. Deferred: inline \$ ... \$, \\begin{env} ... \\end{env} environments. The math enhancer may emit any of these forms — confirm coverage.

(2) Bring clean_ocr up to date with corpus.clean's LaTeX ideas

clean_ocr (Rust backend glossapi_rs_noise) and clean (Rust backend glossapi_rs_cleaner) developed in parallel and don't share their LaTeX-handling code. Concretely missing on the OCR side:

  • The 8-detector cropping heuristic family from latex_module.rs (degenerate fracs, cyclic elements, monotonic runs, etc.).
  • Phase A formatting (paragraph reflow / GFM table delimiter min / HR canonicalization) — relevant after Pilot B integration via PhaseAMode (see Phase A reflow may shift HTML-comment line numbers used by math-enhance replacement step #97 and the parser-backed Phase A docs).
  • The non_destructive_canonicalize shared baseline used by corpus.clean's verifier.

Action items:

  • Decide architectural split: keep clean_ocr as a separate path (different invariants for OCR'd vs Docling-extracted MD) but factor the COMMON LaTeX-cropping detectors into a shared crate / module that both glossapi_rs_cleaner and glossapi_rs_noise consume.
  • Or merge: have clean_ocr call into glossapi_rs_cleaner with an OCR-profile config rather than maintain two binaries.
  • Either way: ensure that when a corpus row passes through clean_ocr (e.g., math-enhanced rows), the LaTeX cropping detectors apply consistently with what corpus.clean would do.

References

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions