Implement Multi-Format Document Content Extraction Framework #233

santosh-kumar-g-cloudambassadors · 2025-01-30T18:02:03Z

This PR introduces comprehensive multi-format content extraction capabilities, enhances error handling, and adds robust test coverage. The changes enable seamless extraction of text from PDFs, Microsoft Office documents (DOC/DOCX, XLS/XLSX/XLSM, PPT/PPTX), and plain text files, supporting both local and web-hosted sources.

Fixes #1
Fixes #138

Key Enhancements

1. Enhanced PDFExtractor

✨ Web & Local PDF Support: Extract text from both URL-hosted and local PDFs.
📄 Page-Wise Extraction: Improved parsing to handle large documents efficiently.
🔍 Text Normalization: Handle special characters and whitespace for cleaner output.
🛠️ Error Handling: Graceful failure and logging for invalid URLs, corrupted files, and extraction errors.

2. Microsoft Office Suite Support

📄Word Documents
- .doc and .docx support
- Paragraph-level extraction
- Format preservation options
📎Excel Workbooks
- .xls, .xlsx, .xlsm handling
- Multi-sheet support
🎞️PowerPoint Presentations
- .ppt and .pptx compatibility
- Slide-wise content extraction
📝 Text File Processing:
- Auto-detect file encodings
- Local and URL-based
- UTF-8 and extended charset support

3. Files from Cloud Integration Ready

☁️ Directly integrate cloud-hosted documents (e.g., Google Drive, GCP Cloud Storage Buckets, AWS S3, etc) into podcast generation pipelines.

4. Unified ContentExtractor

🤖 Smart Source Detection: Auto-identify file types (PDF, Office, Text) and sources (local vs. URL).
🔄 Modular Integration: Leverage PDFExtractor, OfficeExtractor, and TextExtractor for multi-format support.
📊 Main Method Updates: Tested with diverse inputs (e.g., web PDFs, local XLSX files, URL-hosted DOCX).

5. Comprehensive Test Suite

✅ Expanded Coverage: Unit tests for edge cases (large files, malformed URLs, encoding issues).
🧪 Mocked Web Requests: Safely simulate web-hosted file interactions.
🛡️ Error Scenario Tests: Validate handling of timeouts, invalid paths, and unsupported formats.

…ts (.doc, .docx, .xls, .xlsx, xlsm, .ppt and .pptx) online and local files

…libraries

santosh-kumar-g-cloudambassadors · 2025-01-31T06:57:56Z

Hi @souzatharsis
Request you to please check this PR and run the checks.
thanks :)

souzatharsis · 2025-02-01T20:06:48Z

Hi Santosh, many thanks for your PR.

I think it enables useful features to users.
However, it adds complexity to the implementation.
Instead of implementing parsers per document type, I'd favor using solutions such as Docling which (i) provides a unified way to parse multi-type documents with (ii) a widely supported implementation by open source community and (iii) advanced OCR capabilities.

I'd recommend taking a look at Docling which potentially would deliver the same or better outcome with a considerably fewer lines of code.

What do you think?

santosh-kumar-g-cloudambassadors · 2025-02-02T02:46:35Z

Hi Santosh, many thanks for your PR.

I think it enables useful features to users. However, it adds complexity to the implementation. Instead of implementing parsers per document type, I'd favor using solutions such as Docling which (i) provides a unified way to parse multi-type documents with (ii) a widely supported implementation by open source community and (iii) advanced OCR capabilities.

I'd recommend taking a look at Docling which potentially would deliver the same or better outcome with a considerably fewer lines of code.

What do you think?

Sure @souzatharsis
I'll have a look and see if I can contribute using docling 👍🏽

santosh-kumar-g-cloudambassadors added 6 commits January 30, 2025 22:50

Enhance PDF Extractor with URL and Robust Text Extraction Support

463fde0

Add TextExtractor for robust text file content extraction

28a03d9

Add OfficeExtractor for extracting text from Microsoft Office documen…

d9410e9

…ts (.doc, .docx, .xls, .xlsx, xlsm, .ppt and .pptx) online and local files

Enhance ContentExtractor with multi-format content extraction support

e4648ef

Add comprehensive test suite for content extraction modules

0ff31d2

Update dev-requirements with additional document and data processing …

6a93327

…libraries

Merge branch 'souzatharsis:main' into multidoc-support

153b5b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Multi-Format Document Content Extraction Framework #233

Implement Multi-Format Document Content Extraction Framework #233

santosh-kumar-g-cloudambassadors commented Jan 30, 2025 •

edited

Loading

santosh-kumar-g-cloudambassadors commented Jan 31, 2025

souzatharsis commented Feb 1, 2025

santosh-kumar-g-cloudambassadors commented Feb 2, 2025

Implement Multi-Format Document Content Extraction Framework #233

Are you sure you want to change the base?

Implement Multi-Format Document Content Extraction Framework #233

Conversation

santosh-kumar-g-cloudambassadors commented Jan 30, 2025 • edited Loading

Key Enhancements

santosh-kumar-g-cloudambassadors commented Jan 31, 2025

souzatharsis commented Feb 1, 2025

santosh-kumar-g-cloudambassadors commented Feb 2, 2025

santosh-kumar-g-cloudambassadors commented Jan 30, 2025 •

edited

Loading