Skip to content

Conversation

HatmanStack
Copy link

@HatmanStack HatmanStack commented Sep 23, 2025

Text Extraction Enhancement

What Changed

Added intelligent PDF routing to avoid unnecessary OCR costs for text-native PDFs.

New Components

TextExtractionService

  • Located in idp_common/text_extraction
  • Uses PyMuPDF to inspect PDFs and detect if they contain extractable text
  • Extracts text directly from digital PDFs without OCR

Enhanced OcrService

  • Now checks if PDFs are text-native before sending to OCR
  • Routes text-native PDFs to direct extraction
  • Sends scanned PDFs and images to OCR as before
  • Returns same Document object format regardless of path

Processing Flow

PDF Upload → PDF Inspection → Text-native? 
                                  ↓
                               Yes → Direct text extraction
                                No → OCR (existing flow)

Benefits

  • Cost Reduction: No Textract calls for digital PDFs
  • Faster Processing: Direct text extraction is much faster than OCR
  • Same Output: Document object structure unchanged

Tasks

  • Test processing with diverse data
  • Filter out non-relevant images from processing pipeline
  • Include processed image metrics in reporting dashboard

google-labs-jules bot and others added 5 commits September 23, 2025 00:22
This commit introduces a new `TextExtractionService` to intelligently handle PDF processing, reducing unnecessary calls to expensive OCR services.

The key changes are:
- A new `TextExtractionService` is created in `idp_common/text_extraction`. This service uses PyMuPDF to inspect PDFs and determine if they are text-native or scanned.
- The main `OcrService` is refactored to use this new service. It now routes text-native PDFs to a direct text extraction path, bypassing OCR, while scanned PDFs and images are processed by the configured OCR backend as before.
- This change maintains backward compatibility by ensuring the output `Document` object has a consistent structure regardless of the processing path.
- Comprehensive unit tests have been added for the new service and the new routing logic within the `OcrService`.
- Documentation has been updated to reflect the new, more efficient text extraction flow.

1. **PDF Inspection**: When a PDF document is processed, it's first inspected by the `TextExtractionService`.
2. **Content-based Routing**:
- **Text-Native PDFs**: If the service detects a significant amount of selectable text, it extracts the text directly using `PyMuPDF`. This bypasses the OCR backend (Textract or Bedrock) entirely. All required document artifacts (page images, parsed text) are still generated for compatibility with the rest of the IDP pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it's a mixture - i.e. a Text PDF with embedded images that are important to OCR and extract?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right on this. will revisit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a new class, or can we keep it neatly nested and abstracted inside the existing OCR class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking about the expansion to other text formats. Right now we have: ["txt", "csv", "xlsx", "docx"]. I'd like to expand to be something more comprehensive: [ ".docx", ".txt", ".rtf", ".odt", ".doc", ".pages", ".html", ".htm", ".xml", ".md", ".markdown", ".json", ".yaml", ".yml", ".csv", ".tsv", ".rst", ".asciidoc", ".adoc", ".epub", ".mobi", ".tex", ".py", ".js", ".java", ".sql", ".ini", ".conf", ".config", ".log", ".env", "xlsx"] I haven't done the work yet to understand what each will take and thought a new module might make sense for the future of the feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it all abstracted behind the OCR class.. I feel this is a feature of OCR - not a new top level class.

@rstrahan
Copy link
Contributor

What's the latest status on this @HatmanStack - are you still tinkering with it or do you feel it's solid and ready to merge? Maybe add some notes on your (new feature + regression) testing to provide confidence. Tx!

@HatmanStack
Copy link
Author

Going to take time to properly test and feel confident about not losing context with the hybrid text extraction. Would also like to add some logic to filter non-pertinent images, logos etc... Maybe by the weekend.

google-labs-jules bot and others added 5 commits September 30, 2025 23:13
This change introduces a new, efficient processing path for text-based PDFs within the document processing pipeline.

The `text_extraction` service now generates a structured manifest of content blocks (text and images) from text-based PDFs. For image blocks, it performs OCR and includes the OCR'd text, raw image data, and S3 destination key in the manifest.

The `ocr` service has been updated to call the `text_extraction` service for text-based PDFs. It then processes the returned manifest, saving images to S3 and compiling the final text data.

This new workflow improves the handling of text-native PDFs with embedded images, ensuring that all content is accurately extracted and processed.
@HatmanStack
Copy link
Author

@rstrahan This is ready for review. Testing showed ~80% reduction in token usage when using Bedrock for extraction with the hybrid method, and reduced costs to $0 for native text docs that were previously processed by Textract. This should deliver significant cost savings for users.

Changes:

  • Implemented hybrid PDF extraction that auto-detects text-native vs. scanned documents
  • Directly extracts machine-readable text while simultaneously OCRing embedded images/formulas
  • Intelligent image filtering (size/position/aspect ratio) processes only content-relevant visuals, excluding decorative elements

Additional Updates:

  • Expanded structured content extraction: tables, images, formulas, form fields, hyperlinks, annotations, metadata
  • Increased format support from 4 to 50+ file types (TSV, RTF, ODT, EPUB, Markdown, HTML, code files)
  • Uses core Python libraries for most formats with optional library support for enhanced text extraction; gracefully falls back to OCR when optional dependencies aren't installed

Documentation:

  • Added details on the changes and optional library installation for extended format support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants