Recognize page content of a PDF as text Tesseract and Ghostscript.
- Install Visual Studio 2015 Runtime (both x86 & x64)
- Install Ghostscript (x86 or x64, depending on your computer)
- Clone or download this repository.
- Open the solution in Visual Studio and run
Install-Package Tesseract -Version 3.0.2
from thePackage Manager Console
. - Download language data files for tesseract 3.04 from the tessdata repository and add them to the
tessdata
folder of your project. SetCopy to output directory
toAlways
for all the copied files. You can copy only the language files you are interested in (e.g. all the files that starts witheng
for English language).
Variable name | Default | Description | |
---|---|---|---|
Input PDF file | inputPdfFile |
test.pdf , included in the repository |
The PDF file whose selected page's content will be recognized as text. |
Page number | pageNumber |
1 |
The number of the page whose content will be recognized as text. |
Recognition language | ocrLanguage |
"eng" |
The language used from tesseract to recognize text. When you change this value, make shure you add the language data files to the tessdata folder. See Installation section. |
DPI converting PDF page to image | pdfToImageDPI |
150 |
Tesseract can't recognize text from PDF pages. This is way we have to convert the PDF page to an image. This property indicates the DPI when making this convertion. |
If you need more information on Tesseract usage, please visit its own repository.