-
Notifications
You must be signed in to change notification settings - Fork 29
Feature/expanded text ingestion #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/expanded text ingestion #62
Conversation
This commit introduces a new `TextExtractionService` to intelligently handle PDF processing, reducing unnecessary calls to expensive OCR services. The key changes are: - A new `TextExtractionService` is created in `idp_common/text_extraction`. This service uses PyMuPDF to inspect PDFs and determine if they are text-native or scanned. - The main `OcrService` is refactored to use this new service. It now routes text-native PDFs to a direct text extraction path, bypassing OCR, while scanned PDFs and images are processed by the configured OCR backend as before. - This change maintains backward compatibility by ensuring the output `Document` object has a consistent structure regardless of the processing path. - Comprehensive unit tests have been added for the new service and the new routing logic within the `OcrService`. - Documentation has been updated to reflect the new, more efficient text extraction flow.
… BKB is pulling vectors from
|
||
1. **PDF Inspection**: When a PDF document is processed, it's first inspected by the `TextExtractionService`. | ||
2. **Content-based Routing**: | ||
- **Text-Native PDFs**: If the service detects a significant amount of selectable text, it extracts the text directly using `PyMuPDF`. This bypasses the OCR backend (Textract or Bedrock) entirely. All required document artifacts (page images, parsed text) are still generated for compatibility with the rest of the IDP pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it's a mixture - i.e. a Text PDF with embedded images that are important to OCR and extract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right on this. will revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any update on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a new class, or can we keep it neatly nested and abstracted inside the existing OCR class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was thinking about the expansion to other text formats. Right now we have: ["txt", "csv", "xlsx", "docx"]. I'd like to expand to be something more comprehensive: [ ".docx", ".txt", ".rtf", ".odt", ".doc", ".pages", ".html", ".htm", ".xml", ".md", ".markdown", ".json", ".yaml", ".yml", ".csv", ".tsv", ".rst", ".asciidoc", ".adoc", ".epub", ".mobi", ".tex", ".py", ".js", ".java", ".sql", ".ini", ".conf", ".config", ".log", ".env", "xlsx"] I haven't done the work yet to understand what each will take and thought a new module might make sense for the future of the feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep it all abstracted behind the OCR class.. I feel this is a feature of OCR - not a new top level class.
What's the latest status on this @HatmanStack - are you still tinkering with it or do you feel it's solid and ready to merge? Maybe add some notes on your (new feature + regression) testing to provide confidence. Tx! |
Going to take time to properly test and feel confident about not losing context with the hybrid text extraction. Would also like to add some logic to filter non-pertinent images, logos etc... Maybe by the weekend. |
This change introduces a new, efficient processing path for text-based PDFs within the document processing pipeline. The `text_extraction` service now generates a structured manifest of content blocks (text and images) from text-based PDFs. For image blocks, it performs OCR and includes the OCR'd text, raw image data, and S3 destination key in the manifest. The `ocr` service has been updated to call the `text_extraction` service for text-based PDFs. It then processes the returned manifest, saving images to S3 and compiling the final text data. This new workflow improves the handling of text-native PDFs with embedded images, ensuring that all content is accurately extracted and processed.
…ering logic for embedded images
@rstrahan This is ready for review. Testing showed ~80% reduction in token usage when using Bedrock for extraction with the hybrid method, and reduced costs to $0 for native text docs that were previously processed by Textract. This should deliver significant cost savings for users. Changes:
Additional Updates:
Documentation:
|
Text Extraction Enhancement
What Changed
Added intelligent PDF routing to avoid unnecessary OCR costs for text-native PDFs.
New Components
TextExtractionService
idp_common/text_extraction
Enhanced OcrService
Processing Flow
Benefits
Tasks