ocr_pdf2txt

A comprehensive Python library for extracting text from PDF files using OCR with advanced features such as layout visualization, audio generation, table extraction, summarization, and translation.

Features

Text Extraction: Extracts text from PDF files using Tesseract OCR.
Layout Visualization: Generates HTML files with OCR overlays to visualize recognized text regions.
Audio Output: Converts extracted text into audio files using gTTS.
Semantic Topic Detection: Identifies high-level semantic topics from the extracted text using spaCy.
Advanced Summarization: Summarizes the extracted text using Hugging Face transformers.
Translation: Translates extracted text into specified languages using googletrans.
Table Extraction: Extracts tables from PDFs into CSV files using tabula-py.
Batch Processing: Processes multiple PDFs concurrently for efficient workflows.

Installation

Prerequisites

Python 3.7+
Tesseract OCR:
- macOS: brew install tesseract
- Windows: Download from Tesseract at UB Mannheim
- Linux: Install via package manager, e.g., sudo apt-get install tesseract-ocr
Poppler: Required by pdf2image
- macOS: brew install poppler
- Windows: Download from Poppler for Windows
- Linux: Install via package manager, e.g., sudo apt-get install poppler-utils
Java: Required by tabula-py
- All OS: Download and install from Java Downloads

Install the Library

pip install ocr_pdf2txt

Usage

Single PDF Processing

from ocr_pdf2txt import ocr_pdf_to_text

pdf_path = "path/to/your/input.pdf"
output_folder = "path/to/output_folder"

ocr_pdf_to_text(
    pdf_path=pdf_path,
    output_folder=output_folder
)

Batch PDF Processing

from ocr_pdf2txt import ocr_batch_pdfs_to_text

pdf_list = [
    "path/to/your/first.pdf",
    "path/to/your/second.pdf",
    # Add more PDF paths
]
output_folder = "path/to/output_directory"

ocr_batch_pdfs_to_text(
    pdf_paths=pdf_list,
    output_folder=output_folder,
    max_workers=4
)

Extract Text Only

from ocr_pdf2txt import pdf_to_text_only

pdf_path = "path/to/your/input.pdf"
text = pdf_to_text_only(pdf_path)
print(text)

Extract Tables

from ocr_pdf2txt import extract_tables_from_pdf

pdf_path = "path/to/your/input.pdf"
output_csv = "path/to/output.csv"

extract_tables_from_pdf(pdf_path, output_csv, pages="all")

API

ocr_pdf_to_text

Extracts text from a PDF file using OCR and saves the output to a text file.

def ocr_pdf_to_text(
    pdf_path: str,
    output_folder: str
):

Extracts text from a single PDF file and performs the following:

Layout Visualization: Creates HTML overlays of OCR results
Audio Output: Generates an MP3 file of the extracted text
Semantic Topic Detection: Prints detected named entity labels
Advanced Summarization: Summarizes the extracted text
Translation: Translates the extracted text into Spanish

pdf_to_text_only

def pdf_to_text_only(pdf_path: str) -> str:

Extracts text from a single PDF and returns it as a string.

extract_tables_from_pdf

def extract_tables_from_pdf(pdf_path: str, output_csv_path: str, pages: str = "all"):

Extracts tables from a PDF and saves them as a CSV file.

ocr_batch_pdfs_to_text

def ocr_batch_pdfs_to_text(
    pdf_paths: List[str],
    output_folder: str,
    max_workers: int = 4
):

Processes multiple PDFs concurrently, performing all OCR operations on each.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ocr_pdf2txt		ocr_pdf2txt
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
combined_output.txt		combined_output.txt
condensed_output.txt		condensed_output.txt
main.py		main.py
pdfs.txt		pdfs.txt
summarize.py		summarize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocr_pdf2txt

Features

Installation

Prerequisites

Install the Library

Usage

Single PDF Processing

Batch PDF Processing

Extract Text Only

Extract Tables

API

ocr_pdf_to_text

pdf_to_text_only

extract_tables_from_pdf

ocr_batch_pdfs_to_text

License

About

Releases

Packages

Languages

License

VerisimilitudeX/ocr_pdf2txt

Folders and files

Latest commit

History

Repository files navigation

ocr_pdf2txt

Features

Installation

Prerequisites

Install the Library

Usage

Single PDF Processing

Batch PDF Processing

Extract Text Only

Extract Tables

API

ocr_pdf_to_text

pdf_to_text_only

extract_tables_from_pdf

ocr_batch_pdfs_to_text

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages