You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are currently TWO MD-cleaning entry points with distinct Rust backends and somewhat-independent ideas:
corpus.clean — Corpus.clean() in src/glossapi/corpus/phase_clean.py, backed by rust/glossapi_rs_cleaner/ (the main cleaner). Owns LaTeX-aware repetition cropping in rust/glossapi_rs_cleaner/src/latex_module.rs (default enable_latex_repetition_crop=true after commit 8f0ce02).
corpus.clean_ocr — Corpus.clean_ocr() at phase_clean.py:3801, backed by rust/glossapi_rs_noise/. Has its own page-loop + repetition / progression / digit / word-window detectors. Multiple debug variants (numeric, word, hybrid, latex_slot_progression, latex).
Meanwhile, corpus.extract --math-enhance (src/glossapi/corpus/phase_ocr_math.py) emits HTML-comment placeholders in the extracted MD; a math model OCRs the corresponding equations from the PDF and the comments are later replaced with real LaTeX (typically \$\$ ... \$\$ blocks or inline \$ ... \$).
Two concerns
(1) Make sure corpus.clean handles math-enhancer outputs cleanly
The math-enhancer produces real LaTeX content. corpus.clean's crop_latex_repetitions is designed for repetitive garbage OCR-LaTeX (degenerate \\frac chains, repeated chars, cyclic patterns). It must NOT spuriously crop legitimate model-OCR'd math.
Action items:
Audit current behaviour of latex_module.rs cropping heuristics on a sample of math-enhancer outputs. Heuristics in scope:
detect_repeated_char_cut
detect_repeated_lines_cut
detect_repeated_element_cut
detect_small_vocab_run_cut
detect_cyclic_element_cut
detect_unbalanced_braces_in_latex_span
detect_degenerate_frac_cut
detect_monotonic_element_cut
Build a small fixture set of "real" model-OCR LaTeX (long but valid equations: integrals, summations, matrix expansions, multi-line displaystyle math) and confirm `enable_latex_repetition_crop=true` does not damage them.
If false positives surface, add provenance-based exception: "if the LaTeX span came from a math-enhancer placeholder substitution, skip aggressive cropping". This needs a marker — either preserve a stable comment ID on the replaced span, or track post-substitution spans during the replacement step.
Currently handled (per latex_module.rs doc-comment): multi-line \$\$ ... \$\$ + single-line \$\$ ... \$\$. Deferred: inline \$ ... \$, \\begin{env} ... \\end{env} environments. The math enhancer may emit any of these forms — confirm coverage.
(2) Bring clean_ocr up to date with corpus.clean's LaTeX ideas
clean_ocr (Rust backend glossapi_rs_noise) and clean (Rust backend glossapi_rs_cleaner) developed in parallel and don't share their LaTeX-handling code. Concretely missing on the OCR side:
The 8-detector cropping heuristic family from latex_module.rs (degenerate fracs, cyclic elements, monotonic runs, etc.).
The non_destructive_canonicalize shared baseline used by corpus.clean's verifier.
Action items:
Decide architectural split: keep clean_ocr as a separate path (different invariants for OCR'd vs Docling-extracted MD) but factor the COMMON LaTeX-cropping detectors into a shared crate / module that both glossapi_rs_cleaner and glossapi_rs_noise consume.
Or merge: have clean_ocr call into glossapi_rs_cleaner with an OCR-profile config rather than maintain two binaries.
Either way: ensure that when a corpus row passes through clean_ocr (e.g., math-enhanced rows), the LaTeX cropping detectors apply consistently with what corpus.clean would do.
Background
There are currently TWO MD-cleaning entry points with distinct Rust backends and somewhat-independent ideas:
corpus.clean—Corpus.clean()insrc/glossapi/corpus/phase_clean.py, backed byrust/glossapi_rs_cleaner/(the main cleaner). Owns LaTeX-aware repetition cropping inrust/glossapi_rs_cleaner/src/latex_module.rs(defaultenable_latex_repetition_crop=trueafter commit8f0ce02).corpus.clean_ocr—Corpus.clean_ocr()atphase_clean.py:3801, backed byrust/glossapi_rs_noise/. Has its own page-loop + repetition / progression / digit / word-window detectors. Multiple debug variants (numeric, word, hybrid, latex_slot_progression, latex).Meanwhile,
corpus.extract --math-enhance(src/glossapi/corpus/phase_ocr_math.py) emits HTML-comment placeholders in the extracted MD; a math model OCRs the corresponding equations from the PDF and the comments are later replaced with real LaTeX (typically\$\$ ... \$\$blocks or inline\$ ... \$).Two concerns
(1) Make sure
corpus.cleanhandles math-enhancer outputs cleanlyThe math-enhancer produces real LaTeX content.
corpus.clean'scrop_latex_repetitionsis designed for repetitive garbage OCR-LaTeX (degenerate\\fracchains, repeated chars, cyclic patterns). It must NOT spuriously crop legitimate model-OCR'd math.Action items:
latex_module.rscropping heuristics on a sample of math-enhancer outputs. Heuristics in scope:detect_repeated_char_cutdetect_repeated_lines_cutdetect_repeated_element_cutdetect_small_vocab_run_cutdetect_cyclic_element_cutdetect_unbalanced_braces_in_latex_spandetect_degenerate_frac_cutdetect_monotonic_element_cutCurrently handled (per
latex_module.rsdoc-comment): multi-line\$\$ ... \$\$+ single-line\$\$ ... \$\$. Deferred: inline\$ ... \$,\\begin{env} ... \\end{env}environments. The math enhancer may emit any of these forms — confirm coverage.(2) Bring
clean_ocrup to date withcorpus.clean's LaTeX ideasclean_ocr(Rust backendglossapi_rs_noise) andclean(Rust backendglossapi_rs_cleaner) developed in parallel and don't share their LaTeX-handling code. Concretely missing on the OCR side:latex_module.rs(degenerate fracs, cyclic elements, monotonic runs, etc.).PhaseAMode(see Phase A reflow may shift HTML-comment line numbers used by math-enhance replacement step #97 and the parser-backed Phase A docs).non_destructive_canonicalizeshared baseline used bycorpus.clean's verifier.Action items:
clean_ocras a separate path (different invariants for OCR'd vs Docling-extracted MD) but factor the COMMON LaTeX-cropping detectors into a shared crate / module that bothglossapi_rs_cleanerandglossapi_rs_noiseconsume.clean_ocrcall intoglossapi_rs_cleanerwith an OCR-profile config rather than maintain two binaries.clean_ocr(e.g., math-enhanced rows), the LaTeX cropping detectors apply consistently with whatcorpus.cleanwould do.References
rust/glossapi_rs_cleaner/docs/PHASE_A_PARSER_BACKED_INDEX.mdrust/glossapi_rs_cleaner/src/latex_module.rssrc/glossapi/corpus/phase_clean.py:3801(clean_ocr)src/glossapi/corpus/phase_ocr_math.py🤖 Generated with Claude Code