Feature/expanded text ingestion #62

HatmanStack · 2025-09-23T02:13:46Z

Text Extraction Enhancement

What Changed

Added intelligent PDF routing to avoid unnecessary OCR costs for text-native PDFs.

New Components

TextExtractionService

Located in idp_common/text_extraction
Uses PyMuPDF to inspect PDFs and detect if they contain extractable text
Extracts text directly from digital PDFs without OCR

Enhanced OcrService

Now checks if PDFs are text-native before sending to OCR
Routes text-native PDFs to direct extraction
Sends scanned PDFs and images to OCR as before
Returns same Document object format regardless of path

Processing Flow

PDF Upload → PDF Inspection → Text-native? 
                                  ↓
                               Yes → Direct text extraction
                                No → OCR (existing flow)

Benefits

Cost Reduction: No Textract calls for digital PDFs
Faster Processing: Direct text extraction is much faster than OCR
Same Output: Document object structure unchanged

Tasks

Test processing with diverse data
Filter out non-relevant images from processing pipeline
Include processed image metrics in reporting dashboard

This commit introduces a new `TextExtractionService` to intelligently handle PDF processing, reducing unnecessary calls to expensive OCR services. The key changes are: - A new `TextExtractionService` is created in `idp_common/text_extraction`. This service uses PyMuPDF to inspect PDFs and determine if they are text-native or scanned. - The main `OcrService` is refactored to use this new service. It now routes text-native PDFs to a direct text extraction path, bypassing OCR, while scanned PDFs and images are processed by the configured OCR backend as before. - This change maintains backward compatibility by ensuring the output `Document` object has a consistent structure regardless of the processing path. - Comprehensive unit tests have been added for the new service and the new routing logic within the `OcrService`. - Documentation has been updated to reflect the new, more efficient text extraction flow.

… BKB is pulling vectors from

rstrahan · 2025-09-23T14:27:58Z

lib/idp_common_pkg/idp_common/ocr/README.md

+
+1.  **PDF Inspection**: When a PDF document is processed, it's first inspected by the `TextExtractionService`.
+2.  **Content-based Routing**:
+    -   **Text-Native PDFs**: If the service detects a significant amount of selectable text, it extracts the text directly using `PyMuPDF`. This bypasses the OCR backend (Textract or Bedrock) entirely. All required document artifacts (page images, parsed text) are still generated for compatibility with the rest of the IDP pipeline.


What happens if it's a mixture - i.e. a Text PDF with embedded images that are important to OCR and extract?

you are right on this. will revisit.

Any update on this?

lib/idp_common_pkg/idp_common/ocr/service.py

rstrahan · 2025-09-23T14:31:24Z

lib/idp_common_pkg/idp_common/text_extraction/service.py

Does this need to be a new class, or can we keep it neatly nested and abstracted inside the existing OCR class?

was thinking about the expansion to other text formats. Right now we have: ["txt", "csv", "xlsx", "docx"]. I'd like to expand to be something more comprehensive: [ ".docx", ".txt", ".rtf", ".odt", ".doc", ".pages", ".html", ".htm", ".xml", ".md", ".markdown", ".json", ".yaml", ".yml", ".csv", ".tsv", ".rst", ".asciidoc", ".adoc", ".epub", ".mobi", ".tex", ".py", ".js", ".java", ".sql", ".ini", ".conf", ".config", ".log", ".env", "xlsx"] I haven't done the work yet to understand what each will take and thought a new module might make sense for the future of the feature?

I'd rather keep it all abstracted behind the OCR class.. I feel this is a feature of OCR - not a new top level class.

patterns/pattern-2/template.yaml

…ling

rstrahan · 2025-09-25T13:05:56Z

What's the latest status on this @HatmanStack - are you still tinkering with it or do you feel it's solid and ready to merge? Maybe add some notes on your (new feature + regression) testing to provide confidence. Tx!

HatmanStack · 2025-09-25T13:17:42Z

Going to take time to properly test and feel confident about not losing context with the hybrid text extraction. Would also like to add some logic to filter non-pertinent images, logos etc... Maybe by the weekend.

This change introduces a new, efficient processing path for text-based PDFs within the document processing pipeline. The `text_extraction` service now generates a structured manifest of content blocks (text and images) from text-based PDFs. For image blocks, it performs OCR and includes the OCR'd text, raw image data, and S3 destination key in the manifest. The `ocr` service has been updated to call the `text_extraction` service for text-based PDFs. It then processes the returned manifest, saving images to S3 and compiling the final text data. This new workflow improves the handling of text-native PDFs with embedded images, ensuring that all content is accurately extracted and processed.

…ering logic for embedded images

HatmanStack · 2025-10-02T12:11:25Z

@rstrahan This is ready for review. Testing showed ~80% reduction in token usage when using Bedrock for extraction with the hybrid method, and reduced costs to $0 for native text docs that were previously processed by Textract. This should deliver significant cost savings for users.

Changes:

Implemented hybrid PDF extraction that auto-detects text-native vs. scanned documents
Directly extracts machine-readable text while simultaneously OCRing embedded images/formulas
Intelligent image filtering (size/position/aspect ratio) processes only content-relevant visuals, excluding decorative elements

Additional Updates:

Expanded structured content extraction: tables, images, formulas, form fields, hyperlinks, annotations, metadata
Increased format support from 4 to 50+ file types (TSV, RTF, ODT, EPUB, Markdown, HTML, code files)
Uses core Python libraries for most formats with optional library support for enhanced text extraction; gracefully falls back to OCR when optional dependencies aren't installed

Documentation:

Added details on the changes and optional library installation for extended format support

google-labs-jules bot and others added 5 commits September 23, 2025 00:22

Added expanded text ingestion for text based pdfs

e828c74

Native Text Extraction Tested

cb82773

Added document text to textConfidence.json as that appears to be what…

9b7b8d1

… BKB is pulling vectors from

Inadvertent change to assessment properties

643bdd3

rstrahan requested changes Sep 23, 2025

View reviewed changes

HatmanStack added 2 commits September 23, 2025 11:27

revert dual toggles assesment UI

91c042c

Init of pdf native text extraction with hybrid intelligent image hand…

67b37f5

…ling

google-labs-jules bot and others added 5 commits September 30, 2025 23:13

Expanded text supported formats | documentation upgrade | Hybrid filt…

e4e3777

…ering logic for embedded images

Parallel Processing | Metering

37238b4

cleaned up excess files | moved text_extraction into ocr module

21b2c0d

revert to standard ocr for pdf's when using textract

f3a2283

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/expanded text ingestion #62

Feature/expanded text ingestion #62

Uh oh!

HatmanStack commented Sep 23, 2025 •

edited

Loading

Uh oh!

rstrahan Sep 23, 2025

Uh oh!

HatmanStack Sep 23, 2025

Uh oh!

rstrahan Sep 25, 2025

Uh oh!

Uh oh!

rstrahan Sep 23, 2025

Uh oh!

HatmanStack Sep 23, 2025

Uh oh!

rstrahan Sep 25, 2025

Uh oh!

Uh oh!

rstrahan commented Sep 25, 2025

Uh oh!

HatmanStack commented Sep 25, 2025

Uh oh!

HatmanStack commented Oct 2, 2025

Uh oh!

Uh oh!

Feature/expanded text ingestion #62

Are you sure you want to change the base?

Feature/expanded text ingestion #62

Uh oh!

Conversation

HatmanStack commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Text Extraction Enhancement

What Changed

New Components

TextExtractionService

Enhanced OcrService

Processing Flow

Benefits

Tasks

Uh oh!

rstrahan Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

HatmanStack Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rstrahan Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

HatmanStack Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rstrahan commented Sep 25, 2025

Uh oh!

HatmanStack commented Sep 25, 2025

Uh oh!

HatmanStack commented Oct 2, 2025

Uh oh!

Uh oh!

HatmanStack commented Sep 23, 2025 •

edited

Loading