feat(parse): implement OCR text extraction for image parser by mvanhorn · Pull Request #942 · volcengine/OpenViking

mvanhorn · 2026-03-25T00:18:07Z

Description

Implement the _ocr_extract() method in ImageParser using pytesseract (Python binding for Tesseract OCR). The method was a stub returning None with an explicit TODO at image.py:203.

Follows the same async pattern from _asr_transcribe() in the audio parser (PR #805): wraps the synchronous pytesseract call in asyncio.run_in_executor() and degrades gracefully when pytesseract is not installed.

Related Issue

Relates to #372

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Replace _ocr_extract() stub in openviking/parse/parsers/media/image.py with pytesseract integration
Add [ocr] optional dependency group in pyproject.toml (pip install openviking[ocr])
Add tests in tests/parse/test_image_ocr.py covering: text extraction, empty text, missing pytesseract, exception handling, language parameter passthrough

Testing

5 test cases in tests/parse/test_image_ocr.py using mocked pytesseract
Verified locally with Tesseract 5.5.2 on a generated test image containing "OpenViking OCR Test 2026" - text was extracted correctly
ruff format and ruff check pass

Why this matters

ImageParser already has VLM description support (_vlm_describe) and config fields for OCR (enable_ocr, ocr_lang in ImageConfig), but _ocr_extract returned None. Images containing text (screenshots, documents, whiteboards) lost their textual content during ingestion. This fills the gap using the same pattern that worked for audio transcription.

Design decisions

pytesseract over PaddleOCR: lighter dependency (no torch). Chinese text works via chi_sim lang pack in Tesseract.
Optional dependency: pytesseract is not added to core deps. ImportError returns None with a warning, matching how the codebase handles optional providers.
No new config fields: reuses existing enable_ocr and ocr_lang from ImageConfig.

This contribution was developed with AI assistance (Claude Code).

Replace the _ocr_extract() stub with a working Tesseract integration via pytesseract. Uses asyncio.run_in_executor() for the synchronous pytesseract call, matching the pattern from _asr_transcribe() in the audio parser (PR volcengine#805). Gracefully degrades when pytesseract is not installed by returning None with a warning. Added as optional dependency: pip install openviking[ocr] Relates to volcengine#372

github-actions · 2026-03-25T00:18:51Z

Failed to generate code suggestions for PR

mvanhorn · 2026-03-25T01:07:45Z

The build distribution CI failures (No module named pip in the isolated build env) appear to be a pre-existing infrastructure issue, not related to the changes in this PR.

The ocr optional dependency group I added to pyproject.toml triggers the check-deps gate (which normally skips the build matrix). The failure happens during python -m build --wheel bootstrapping - before the build tool even reads the new extras. The Linux container build (ubuntu:20.04) loses pip accessibility after the git reset --hard && git clean -fd workspace clean step.

lint, test-lite, check-deps, and CLA all pass. Are the build distribution jobs expected to pass for PRs that touch pyproject.toml, or is this a known issue?

github-project-automation bot added this to OpenViking project Mar 25, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 25, 2026

mvanhorn mentioned this pull request Mar 25, 2026

feat(parse): implement video key frame extraction with metadata #943

Open

7 tasks

MaojiaSheng approved these changes Mar 27, 2026

View reviewed changes

MaojiaSheng merged commit 485e6df into volcengine:main Mar 27, 2026
5 of 11 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parse): implement OCR text extraction for image parser#942

feat(parse): implement OCR text extraction for image parser#942
MaojiaSheng merged 1 commit intovolcengine:mainfrom
mvanhorn:osc/372-ocr-text-extraction

mvanhorn commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

mvanhorn commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvanhorn commented Mar 25, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Why this matters

Design decisions

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

mvanhorn commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants