Skip to content

Commit d45c718

Browse files
Added script to scrape PDF #29
1 parent ce9734c commit d45c718

File tree

2 files changed

+46
-0
lines changed

2 files changed

+46
-0
lines changed

ScrapePDF/Readme.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Script to scrape pdf
2+
3+
## Overview:
4+
- A beginner friendly script to scrape pdf. You can easily get document info sunch as creator , crceation_date and no. of pages. Extract as many pages as you want.
5+
6+
7+
### Installing required libraries
8+
9+
`` pip install PyPDF2 ``
10+
11+
## How to use this script?
12+
13+
- Direct to the ScapePDF folder in Command prompt and type the following command:
14+
15+
python pdfscrapper.py

ScrapePDF/pdfscrapper.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# import PyPDF2 library
2+
import PyPDF2 as p2
3+
4+
PDFfile = open("File path here.pdf", "rb")
5+
pdfread = p2.PdfFileReader(PDFfile)
6+
7+
8+
# to check wheather the pdf is encrypted or not
9+
print(pdfread.getIsEncrypted())
10+
11+
12+
# to get information about the document like creator, creation_date
13+
print(pdfread.getDocumentInfo())
14+
15+
16+
# to get number of pages
17+
print(pdfread.getNumPages())
18+
19+
20+
# To extract text from a singl page of pdf
21+
a = int(input("Enter the page no. from which you want to extract text: \n"))
22+
x = pdfread.getPage(a)
23+
print(x.extractText())
24+
25+
26+
# Extract entire pdf
27+
i = 0
28+
while i<pdfread.getNumPages():
29+
pageinfo = pdfread.getPage(i)
30+
print(pageinfo.extractText())
31+
i += 1

0 commit comments

Comments
 (0)