IndoxMiner

IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.

🚀 Key Features

Multi-Format Support: Extract data from text, PDFs, images, and scanned documents
Schema-Based Extraction: Define custom schemas to specify exactly what data to extract
LLM Integration: Seamless integration with OpenAI models for intelligent extraction
Validation & Type Safety: Built-in validation rules and type-safe field definitions
Flexible Output: Export to JSON, pandas DataFrames, or custom formats
Async Support: Built for scalability with asynchronous processing capabilities
OCR Integration: Multiple OCR engine options for image-based text extraction
High-Resolution Support: Enhanced processing for high-quality PDFs
Error Handling: Comprehensive error handling and validation reporting

📦 Installation

pip install indoxminer

🎯 Quick Start

Basic Text Extraction

from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(
    api_key="your-api-key",
    model="gpt-4-mini"
)

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="product_name",
            description="Product name",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="price",
            description="Price in USD",
            field_type=FieldType.FLOAT,
            rules=ValidationRule(min_value=0)
        )
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)

PDF Processing

from indoxminer import DocumentProcessor, ProcessingConfig

# Initialize processor with custom config
processor = DocumentProcessor(
    files=["invoice.pdf"],
    config=ProcessingConfig(
        hi_res_pdf=True,
        chunk_size=1000
    )
)

# Process document
documents = processor.process()

# Extract structured data
schema = ExtractorSchema(
    fields=[
        Field(
            name="bill_to",
            description="Billing address",
            field_type=FieldType.STRING
        ),
        Field(
            name="invoice_date",
            description="Invoice date",
            field_type=FieldType.DATE
        ),
        Field(
            name="total_amount",
            description="Total amount in USD",
            field_type=FieldType.FLOAT
        )
    ]
)

results = await extractor.extract(documents)

Image Processing with OCR

# Configure OCR-enabled processor
config = ProcessingConfig(
    ocr_enabled=True,
    ocr_engine="easyocr",  # or "tesseract", "paddle"
    language="en"
)

processor = DocumentProcessor(
    files=["receipt.jpg"],
    config=config
)

# Process image and extract text
documents = processor.process()

🔧 Core Components

ExtractorSchema

Defines the structure of data to be extracted:

Field definitions
Validation rules
Output format specifications

schema = ExtractorSchema(
    fields=[...],
    output_format="json"
)

Field Types

Supported field types:

STRING: Text data
INTEGER: Whole numbers
FLOAT: Decimal numbers
DATE: Date values
BOOLEAN: True/False values
LIST: Arrays of values
DICT: Nested objects

Validation Rules

Available validation options:

min_length/max_length: String length constraints
min_value/max_value: Numeric bounds
pattern: Regex patterns
required: Required fields
custom: Custom validation functions

⚙️ Configuration Options

ProcessingConfig

config = ProcessingConfig(
    hi_res_pdf=True,          # High-resolution PDF processing
    ocr_enabled=True,         # Enable OCR
    ocr_engine="tesseract",   # OCR engine selection
    chunk_size=1000,          # Text chunk size
    language="en",            # Processing language
    max_threads=4             # Parallel processing threads
)

🔍 Error Handling

IndoxMiner provides detailed error reporting:

results = await extractor.extract(documents)

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Errors in chunk {chunk_idx}:")
        for error in errors:
            print(f"- {error.field}: {error.message}")

# Access valid results
valid_data = results.get_valid_results()

🤝 Contributing

We welcome contributions! To contribute:

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

Please read our Contributing Guidelines for more details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Documentation: Full documentation
Issues: GitHub Issues
Discussions: GitHub Discussions

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
docs		docs
examples		examples
indoxMiner		indoxMiner
.gitignore		.gitignore
Branch_and_PR_Guidelines.md		Branch_and_PR_Guidelines.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndoxMiner

🚀 Key Features

📦 Installation

🎯 Quick Start

Basic Text Extraction

PDF Processing

Image Processing with OCR

🔧 Core Components

ExtractorSchema

Field Types

Validation Rules

⚙️ Configuration Options

ProcessingConfig

🔍 Error Handling

🤝 Contributing

📄 License

🆘 Support

🌟 Star History

About

Releases

Packages

Contributors 4

Languages

License

osllmai/IndoxMiner

Folders and files

Latest commit

History

Repository files navigation

IndoxMiner

🚀 Key Features

📦 Installation

🎯 Quick Start

Basic Text Extraction

PDF Processing

Image Processing with OCR

🔧 Core Components

ExtractorSchema

Field Types

Validation Rules

⚙️ Configuration Options

ProcessingConfig

🔍 Error Handling

🤝 Contributing

📄 License

🆘 Support

🌟 Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages