IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.
- Multi-Format Support: Extract data from text, PDFs, images, and scanned documents
- Schema-Based Extraction: Define custom schemas to specify exactly what data to extract
- LLM Integration: Seamless integration with OpenAI models for intelligent extraction
- Validation & Type Safety: Built-in validation rules and type-safe field definitions
- Flexible Output: Export to JSON, pandas DataFrames, or custom formats
- Async Support: Built for scalability with asynchronous processing capabilities
- OCR Integration: Multiple OCR engine options for image-based text extraction
- High-Resolution Support: Enhanced processing for high-quality PDFs
- Error Handling: Comprehensive error handling and validation reporting
pip install indoxminer
from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi
# Initialize OpenAI extractor
llm_extractor = OpenAi(
api_key="your-api-key",
model="gpt-4-mini"
)
# Define extraction schema
schema = ExtractorSchema(
fields=[
Field(
name="product_name",
description="Product name",
field_type=FieldType.STRING,
rules=ValidationRule(min_length=2)
),
Field(
name="price",
description="Price in USD",
field_type=FieldType.FLOAT,
rules=ValidationRule(min_value=0)
)
]
)
# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""
# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)
from indoxminer import DocumentProcessor, ProcessingConfig
# Initialize processor with custom config
processor = DocumentProcessor(
files=["invoice.pdf"],
config=ProcessingConfig(
hi_res_pdf=True,
chunk_size=1000
)
)
# Process document
documents = processor.process()
# Extract structured data
schema = ExtractorSchema(
fields=[
Field(
name="bill_to",
description="Billing address",
field_type=FieldType.STRING
),
Field(
name="invoice_date",
description="Invoice date",
field_type=FieldType.DATE
),
Field(
name="total_amount",
description="Total amount in USD",
field_type=FieldType.FLOAT
)
]
)
results = await extractor.extract(documents)
# Configure OCR-enabled processor
config = ProcessingConfig(
ocr_enabled=True,
ocr_engine="easyocr", # or "tesseract", "paddle"
language="en"
)
processor = DocumentProcessor(
files=["receipt.jpg"],
config=config
)
# Process image and extract text
documents = processor.process()
Defines the structure of data to be extracted:
- Field definitions
- Validation rules
- Output format specifications
schema = ExtractorSchema(
fields=[...],
output_format="json"
)
Supported field types:
STRING
: Text dataINTEGER
: Whole numbersFLOAT
: Decimal numbersDATE
: Date valuesBOOLEAN
: True/False valuesLIST
: Arrays of valuesDICT
: Nested objects
Available validation options:
min_length
/max_length
: String length constraintsmin_value
/max_value
: Numeric boundspattern
: Regex patternsrequired
: Required fieldscustom
: Custom validation functions
config = ProcessingConfig(
hi_res_pdf=True, # High-resolution PDF processing
ocr_enabled=True, # Enable OCR
ocr_engine="tesseract", # OCR engine selection
chunk_size=1000, # Text chunk size
language="en", # Processing language
max_threads=4 # Parallel processing threads
)
IndoxMiner provides detailed error reporting:
results = await extractor.extract(documents)
if not results.is_valid:
for chunk_idx, errors in results.validation_errors.items():
print(f"Errors in chunk {chunk_idx}:")
for error in errors:
print(f"- {error.field}: {error.message}")
# Access valid results
valid_data = results.get_valid_results()
We welcome contributions! To contribute:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Please read our Contributing Guidelines for more details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions