feat(parse): implement OCR text extraction for image parser#942
Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom Mar 27, 2026
Merged
Conversation
Replace the _ocr_extract() stub with a working Tesseract integration via pytesseract. Uses asyncio.run_in_executor() for the synchronous pytesseract call, matching the pattern from _asr_transcribe() in the audio parser (PR volcengine#805). Gracefully degrades when pytesseract is not installed by returning None with a warning. Added as optional dependency: pip install openviking[ocr] Relates to volcengine#372
|
Failed to generate code suggestions for PR |
7 tasks
Contributor
Author
|
The build distribution CI failures ( The lint, test-lite, check-deps, and CLA all pass. Are the build distribution jobs expected to pass for PRs that touch |
MaojiaSheng
approved these changes
Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implement the
_ocr_extract()method inImageParserusing pytesseract (Python binding for Tesseract OCR). The method was a stub returningNonewith an explicit TODO atimage.py:203.Follows the same async pattern from
_asr_transcribe()in the audio parser (PR #805): wraps the synchronous pytesseract call inasyncio.run_in_executor()and degrades gracefully when pytesseract is not installed.Related Issue
Relates to #372
Type of Change
Changes Made
_ocr_extract()stub inopenviking/parse/parsers/media/image.pywith pytesseract integration[ocr]optional dependency group inpyproject.toml(pip install openviking[ocr])tests/parse/test_image_ocr.pycovering: text extraction, empty text, missing pytesseract, exception handling, language parameter passthroughTesting
tests/parse/test_image_ocr.pyusing mocked pytesseractruff formatandruff checkpassWhy this matters
ImageParseralready has VLM description support (_vlm_describe) and config fields for OCR (enable_ocr,ocr_langinImageConfig), but_ocr_extractreturnedNone. Images containing text (screenshots, documents, whiteboards) lost their textual content during ingestion. This fills the gap using the same pattern that worked for audio transcription.Design decisions
chi_simlang pack in Tesseract.ImportErrorreturnsNonewith a warning, matching how the codebase handles optional providers.enable_ocrandocr_langfromImageConfig.This contribution was developed with AI assistance (Claude Code).