Skip to content

feat(parse): implement OCR text extraction for image parser#942

Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom
mvanhorn:osc/372-ocr-text-extraction
Mar 27, 2026
Merged

feat(parse): implement OCR text extraction for image parser#942
MaojiaSheng merged 1 commit intovolcengine:mainfrom
mvanhorn:osc/372-ocr-text-extraction

Conversation

@mvanhorn
Copy link
Contributor

Description

Implement the _ocr_extract() method in ImageParser using pytesseract (Python binding for Tesseract OCR). The method was a stub returning None with an explicit TODO at image.py:203.

Follows the same async pattern from _asr_transcribe() in the audio parser (PR #805): wraps the synchronous pytesseract call in asyncio.run_in_executor() and degrades gracefully when pytesseract is not installed.

Related Issue

Relates to #372

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Replace _ocr_extract() stub in openviking/parse/parsers/media/image.py with pytesseract integration
  • Add [ocr] optional dependency group in pyproject.toml (pip install openviking[ocr])
  • Add tests in tests/parse/test_image_ocr.py covering: text extraction, empty text, missing pytesseract, exception handling, language parameter passthrough

Testing

  • 5 test cases in tests/parse/test_image_ocr.py using mocked pytesseract
  • Verified locally with Tesseract 5.5.2 on a generated test image containing "OpenViking OCR Test 2026" - text was extracted correctly
  • ruff format and ruff check pass

Why this matters

ImageParser already has VLM description support (_vlm_describe) and config fields for OCR (enable_ocr, ocr_lang in ImageConfig), but _ocr_extract returned None. Images containing text (screenshots, documents, whiteboards) lost their textual content during ingestion. This fills the gap using the same pattern that worked for audio transcription.

Design decisions

  • pytesseract over PaddleOCR: lighter dependency (no torch). Chinese text works via chi_sim lang pack in Tesseract.
  • Optional dependency: pytesseract is not added to core deps. ImportError returns None with a warning, matching how the codebase handles optional providers.
  • No new config fields: reuses existing enable_ocr and ocr_lang from ImageConfig.

This contribution was developed with AI assistance (Claude Code).

Replace the _ocr_extract() stub with a working Tesseract integration
via pytesseract. Uses asyncio.run_in_executor() for the synchronous
pytesseract call, matching the pattern from _asr_transcribe() in the
audio parser (PR volcengine#805).

Gracefully degrades when pytesseract is not installed by returning None
with a warning. Added as optional dependency: pip install openviking[ocr]

Relates to volcengine#372
@github-actions
Copy link

Failed to generate code suggestions for PR

@mvanhorn
Copy link
Contributor Author

The build distribution CI failures (No module named pip in the isolated build env) appear to be a pre-existing infrastructure issue, not related to the changes in this PR.

The ocr optional dependency group I added to pyproject.toml triggers the check-deps gate (which normally skips the build matrix). The failure happens during python -m build --wheel bootstrapping - before the build tool even reads the new extras. The Linux container build (ubuntu:20.04) loses pip accessibility after the git reset --hard && git clean -fd workspace clean step.

lint, test-lite, check-deps, and CLA all pass. Are the build distribution jobs expected to pass for PRs that touch pyproject.toml, or is this a known issue?

@MaojiaSheng MaojiaSheng merged commit 485e6df into volcengine:main Mar 27, 2026
5 of 11 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants