A comprehensive Rust toolkit for extracting and processing documentation from multiple file formats into a universal JSON structure, optimized for vector stores and RAG (Retrieval-Augmented Generation) systems.
Current Version: 0.3.1
Status: β
Production Ready
Python Bindings: β
Fully Functional
Documentation: β
Complete
- β Universal JSON Output: Consistent format across all document types
- β Multiple Format Support: PDF, TXT, JSON, CSV, DOCX
- β Python Bindings: Full PyO3 integration with native performance
- β Intelligent Text Processing: Smart chunking, cleaning, and metadata extraction
- β Modular Architecture: Each document type has its specialized processor
- β Vector Store Ready: Optimized output for embedding and indexing
- β CLI Tools: Both universal processor and format-specific binaries
- β Rich Metadata: Comprehensive document and chunk-level metadata
- β Language Detection: Automatic language detection capabilities
- β Performance Optimized: Fast processing with detailed timing information
- Rust 1.70+ (for compilation)
- Cargo (comes with Rust)
git clone https://github.com/WillIsback/doc_loader.git
cd doc_loader
cargo build --releaseAfter building, you'll have access to these CLI tools:
doc_loader- Universal document processorpdf_processor- PDF-specific processortxt_processor- Plain text processorjson_processor- JSON document processorcsv_processor- CSV file processordocx_processor- DOCX document processor
Process any supported document type with the main binary:
# Basic usage
./target/release/doc_loader --input document.pdf
# With custom options
./target/release/doc_loader \
--input document.pdf \
--output result.json \
--chunk-size 1500 \
--chunk-overlap 150 \
--detect-language \
--prettyUse specialized processors for specific formats:
# Process a PDF
./target/release/pdf_processor --input report.pdf --pretty
# Process a CSV with analysis
./target/release/csv_processor --input data.csv --output analysis.json
# Process a JSON document
./target/release/json_processor --input config.json --detect-languageAll processors support these common options:
--input <FILE>- Input file path (required)--output <FILE>- Output JSON file (optional, defaults to stdout)--chunk-size <SIZE>- Maximum chunk size in characters (default: 1000)--chunk-overlap <SIZE>- Overlap between chunks (default: 100)--no-cleaning- Disable text cleaning--detect-language- Enable language detection--pretty- Pretty print JSON output
All processors generate a standardized JSON structure:
{
"document_metadata": {
"filename": "document.pdf",
"filepath": "/path/to/document.pdf",
"document_type": "PDF",
"file_size": 1024000,
"created_at": "2025-01-01T12:00:00Z",
"modified_at": "2025-01-01T12:00:00Z",
"title": "Document Title",
"author": "Author Name",
"format_metadata": {
// Format-specific metadata
}
},
"chunks": [
{
"id": "pdf_chunk_0",
"content": "Extracted text content...",
"chunk_index": 0,
"position": {
"page": 1,
"line": 10,
"start_offset": 0,
"end_offset": 1000
},
"metadata": {
"size": 1000,
"language": "en",
"confidence": 0.95,
"format_specific": {
// Chunk-specific metadata
}
}
}
],
"processing_info": {
"processor": "PdfProcessor",
"processor_version": "1.0.0",
"processed_at": "2025-01-01T12:00:00Z",
"processing_time_ms": 150,
"total_chunks": 5,
"total_content_size": 5000,
"processing_params": {
"max_chunk_size": 1000,
"chunk_overlap": 100,
"text_cleaning": true,
"language_detection": true
}
}
}The project follows a modular architecture:
src/
βββ lib.rs # Main library interface
βββ main.rs # Universal CLI
βββ error.rs # Error handling
βββ core/ # Core data structures
β βββ mod.rs # Universal output format
βββ utils/ # Utility functions
β βββ mod.rs # Text processing utilities
βββ processors/ # Document processors
β βββ mod.rs # Common processor traits
β βββ pdf.rs # PDF processor
β βββ txt.rs # Text processor
β βββ json.rs # JSON processor
β βββ csv.rs # CSV processor
β βββ docx.rs # DOCX processor
βββ bin/ # Individual CLI binaries
βββ pdf_processor.rs
βββ txt_processor.rs
βββ json_processor.rs
βββ csv_processor.rs
βββ docx_processor.rs
Test the functionality with the provided sample files:
# Test text processing
./target/debug/doc_loader --input test_sample.txt --pretty
# Test JSON processing
./target/debug/json_processor --input test_sample.json --pretty
# Test CSV processing
./target/debug/csv_processor --input test_sample.csv --pretty- Text extraction with lopdf
- Page-based chunking
- Metadata extraction (title, author, creation date)
- Position tracking (page, line, offset)
- Header detection and analysis
- Column statistics (data types, fill rates, unique values)
- Row-by-row or batch processing
- Data completeness analysis
- Hierarchical structure analysis
- Key extraction and statistics
- Nested object flattening
- Schema inference
- Document structure parsing
- Style and formatting preservation
- Section and paragraph extraction
- Metadata extraction
- Encoding detection
- Line and paragraph preservation
- Language detection
- Character and word counting
Use doc_loader as a library in your Rust projects:
use doc_loader::{UniversalProcessor, ProcessingParams};
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let processor = UniversalProcessor::new();
let params = ProcessingParams::default()
.with_chunk_size(1500)
.with_language_detection(true);
let result = processor.process_file(
Path::new("document.pdf"),
Some(params)
)?;
println!("Extracted {} chunks", result.chunks.len());
Ok(())
}- Fast Processing: Optimized for large documents
- Memory Efficient: Streaming processing for large files
- Detailed Metrics: Processing time and statistics
- Concurrent Support: Thread-safe processors
- Enhanced PDF text extraction (pdfium integration)
- Complete DOCX XML parsing
- Unit test coverage
- Performance benchmarks
- Additional formats (XLSX, PPTX, HTML, Markdown)
- Advanced language detection
- Web interface/API
- Vector store integrations
- OCR support for scanned documents
- Parallel processing optimizations
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
[Add your license information here]
Report issues on the project's issue tracker. Include:
- File format and size
- Command used
- Error messages
- Expected vs actual behavior
Doc Loader - Making document processing simple, fast, and universal! π
Doc Loader provides fully functional Python bindings through PyO3, offering the same performance as the native Rust library with a clean Python API.
# Via PyPI (recommandΓ©)
pip install extracteur-docs-rs
# Ou build depuis les sources
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install maturin build tool
pip install maturin
# Build and install Python bindings (Python 3.9+ supported)
venv/bin/maturin develop --features python --releaseimport extracteur_docs_rs as doc_loader
# Quick start - process any supported file format
result = doc_loader.process_file("document.pdf", chunk_size=500)
print(f"Chunks: {result.chunk_count()}")
print(f"Words: {result.total_word_count()}")
print(f"Supported formats: {doc_loader.supported_extensions()}")
# Advanced usage with custom parameters
processor = doc_loader.PyUniversalProcessor()
params = doc_loader.PyProcessingParams(
chunk_size=400,
overlap=60,
clean_text=True,
extract_metadata=True
)
result = processor.process_file("document.txt", params)
# Process text content directly
text_result = processor.process_text_content("Your text here...", params)
# Export to JSON
json_output = result.to_json()- β RAG/Embedding Pipeline: Direct integration with sentence-transformers
- β Data Analysis: Export to pandas DataFrames
- β REST API: Flask/FastAPI endpoints
- β Batch Processing: Process directories of documents
- β Jupyter Notebooks: Interactive document analysis
The Python bindings are fully tested and functional with:
- All file formats supported (PDF, TXT, JSON, CSV, DOCX)
- Complete API coverage matching Rust functionality
- Proper error handling with Python exceptions
- Full parameter customization
- Comprehensive documentation and examples
Run the demo: venv/bin/python python_demo.py
For complete Python documentation, see docs/python_usage.md.