pdf-vision

I'm starting this project because I have a book I want textified that's in PDF format, but it appears to just be scanned pages.

Structure

You can have a scratch/ folder to stash your items, and the gitignore will handle that.

Dependencies

We require:

attrs for data classes.
PyPDF2 for reading PDFs.
Pillow for handling images.
Tesseract for OCR.
OpenCV for image pre-processing.

For my application, I installed the Serbian language pack.

apt install tesseract-ocr-srp

Getting Started

We recommend a virtual environment.

$ python3 -m venv venv
$ source venv/bin/activate
$ python3 -m pip install -r requirements.txt

Sample Input

The book I used to test this program is a Serbian historical text, and can be downloaded from Google Drive: https://drive.google.com/file/d/1ViI2Hq5ohhPO1pM-u_i2Lv3MAUrYS-l4/view?usp=sharing

Sample Output

Input Image:

Text Output:

Међу митским бићима у која је веровао 'а мести- мично и данас верује сеоски народ источне Србије и Баната, својом занимљивошћу и архаичношћу истИ- че се женски шумски демон — шумска мајка.

The entire output text ( unadulterated, I haven't fixed it up at all ) can be downloaded as a tarchive here: https://drive.google.com/file/d/1eLnLqTay_zmiPFjzPOuG1D3fOoMJMRCi/view?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
media		media
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdfreader.py		pdfreader.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-vision

Structure

Dependencies

Getting Started

Sample Input

Sample Output

About

Releases

Packages

Contributors 2

Languages

License

svidovich/pdf-vision

Folders and files

Latest commit

History

Repository files navigation

pdf-vision

Structure

Dependencies

Getting Started

Sample Input

Sample Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages