Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Multi-Format Document Content Extraction Framework #233

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

santosh-kumar-g-cloudambassadors
Copy link
Contributor

@santosh-kumar-g-cloudambassadors santosh-kumar-g-cloudambassadors commented Jan 30, 2025

This PR introduces comprehensive multi-format content extraction capabilities, enhances error handling, and adds robust test coverage. The changes enable seamless extraction of text from PDFs, Microsoft Office documents (DOC/DOCX, XLS/XLSX/XLSM, PPT/PPTX), and plain text files, supporting both local and web-hosted sources.

Fixes #1
Fixes #138

Key Enhancements

1. Enhanced PDFExtractor

  • Web & Local PDF Support: Extract text from both URL-hosted and local PDFs.
  • 📄 Page-Wise Extraction: Improved parsing to handle large documents efficiently.
  • 🔍 Text Normalization: Handle special characters and whitespace for cleaner output.
  • 🛠️ Error Handling: Graceful failure and logging for invalid URLs, corrupted files, and extraction errors.

2. Microsoft Office Suite Support

  • 📄Word Documents

    • .doc and .docx support
    • Paragraph-level extraction
    • Format preservation options
  • 📎Excel Workbooks

    • .xls, .xlsx, .xlsm handling
    • Multi-sheet support
  • 🎞️PowerPoint Presentations

    • .ppt and .pptx compatibility
    • Slide-wise content extraction
  • 📝 Text File Processing:

    • Auto-detect file encodings
    • Local and URL-based
    • UTF-8 and extended charset support

3. Files from Cloud Integration Ready

  • ☁️ Directly integrate cloud-hosted documents (e.g., Google Drive, GCP Cloud Storage Buckets, AWS S3, etc) into podcast generation pipelines.

4. Unified ContentExtractor

  • 🤖 Smart Source Detection: Auto-identify file types (PDF, Office, Text) and sources (local vs. URL).
  • 🔄 Modular Integration: Leverage PDFExtractor, OfficeExtractor, and TextExtractor for multi-format support.
  • 📊 Main Method Updates: Tested with diverse inputs (e.g., web PDFs, local XLSX files, URL-hosted DOCX).

5. Comprehensive Test Suite

  • Expanded Coverage: Unit tests for edge cases (large files, malformed URLs, encoding issues).
  • 🧪 Mocked Web Requests: Safely simulate web-hosted file interactions.
  • 🛡️ Error Scenario Tests: Validate handling of timeouts, invalid paths, and unsupported formats.

@santosh-kumar-g-cloudambassadors
Copy link
Contributor Author

Hi @souzatharsis
Request you to please check this PR and run the checks.
thanks :)

@souzatharsis
Copy link
Owner

Hi Santosh, many thanks for your PR.

I think it enables useful features to users.
However, it adds complexity to the implementation.
Instead of implementing parsers per document type, I'd favor using solutions such as Docling which (i) provides a unified way to parse multi-type documents with (ii) a widely supported implementation by open source community and (iii) advanced OCR capabilities.

I'd recommend taking a look at Docling which potentially would deliver the same or better outcome with a considerably fewer lines of code.

What do you think?

@santosh-kumar-g-cloudambassadors
Copy link
Contributor Author

Hi Santosh, many thanks for your PR.

I think it enables useful features to users. However, it adds complexity to the implementation. Instead of implementing parsers per document type, I'd favor using solutions such as Docling which (i) provides a unified way to parse multi-type documents with (ii) a widely supported implementation by open source community and (iii) advanced OCR capabilities.

I'd recommend taking a look at Docling which potentially would deliver the same or better outcome with a considerably fewer lines of code.

What do you think?

Sure @souzatharsis
I'll have a look and see if I can contribute using docling 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace content_parser with docling to support multiple formats out-of-the-box Support for input text file
2 participants