Skip to content

Commit 06ae88e

Browse files
authored
Merge pull request #95 from kritikaparmar-programmer/scrape
Added script to scrape PDF #29
2 parents d46a66b + c52a783 commit 06ae88e

File tree

3 files changed

+60
-0
lines changed

3 files changed

+60
-0
lines changed

ScrapePDF/Readme.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Script to scrape pdf
2+
3+
## Overview:
4+
- A beginner friendly script to scrape pdf. You can easily get document info sunch as creator , creation_date and no. of pages. Extract as many pages as you want.
5+
6+
7+
### Installing required libraries
8+
9+
`` pip install PyPDF2 ``
10+
11+
## How to use this script?
12+
13+
- Direct to the ScapePDF folder in Command prompt and type the following command:
14+
15+
`` python pdfscrapper.py ``
16+
17+
- After this you have to enter the path of the pdf file.
18+
- Ex: C:/Users/Admin/Desktop/sample.pdf
19+
- You will recieve information about the pdf like who created it and when it was created
20+
- After that you will recieve scrapped text from pdf
21+
22+
### Example Output
23+
24+
<p align = "center">
25+
<img src="ex.png" alt="Output">
26+
</p>
27+

ScrapePDF/ex.png

56.2 KB
Loading

ScrapePDF/pdfscrapper.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# import PyPDF2 library
2+
import PyPDF2 as p2
3+
4+
pdffile = input("Enter path to pdf file you want to scrape: \n")
5+
PDFfile = open(pdffile, "rb")
6+
pdfread = p2.PdfFileReader(PDFfile)
7+
8+
9+
# to check wheather the pdf is encrypted or not
10+
print(pdfread.getIsEncrypted())
11+
12+
13+
# to get information about the document like creator, creation_date
14+
print(pdfread.getDocumentInfo())
15+
16+
17+
# to get number of pages
18+
print(pdfread.getNumPages())
19+
20+
21+
# To extract text from a singl page of pdf
22+
a = int(input("Enter the page no. from which you want to extract text: \n"))
23+
x = pdfread.getPage(a)
24+
print(x.extractText())
25+
26+
27+
# Extract entire pdf
28+
print("---------ENTIRE PDF----------")
29+
i = 0
30+
while i<pdfread.getNumPages():
31+
pageinfo = pdfread.getPage(i)
32+
print(pageinfo.extractText())
33+
i += 1

0 commit comments

Comments
 (0)