Merge pull request #95 from kritikaparmar-programmer/scrape

tejan-singh · web-flow · commit 06ae88e0f44b · 2020-10-22T14:52:51.000+05:30
Added script to scrape PDF #29
diff --git a/ScrapePDF/Readme.md b/ScrapePDF/Readme.md
@@ -0,0 +1,27 @@
+# Script to scrape pdf
+
+## Overview:
+- A beginner friendly script to scrape pdf. You can easily get document info sunch as creator , creation_date and no. of pages. Extract as many pages as you want.
+
+
+### Installing required libraries
+
+`` pip install PyPDF2 ``
+
+## How to use this script?
+
+- Direct to the ScapePDF folder in Command prompt and type the following command:  
+
+`` python pdfscrapper.py ``
+
+- After this you have to enter the path of the pdf file.
+- Ex: C:/Users/Admin/Desktop/sample.pdf
+- You will recieve information  about the pdf like who created it and when it was created 
+- After that you will recieve scrapped text from pdf
+
+### Example Output
+
+<p align = "center">
+	<img src="ex.png" alt="Output">
+</p>
+
diff --git a/ScrapePDF/ex.png b/ScrapePDF/ex.png
diff --git a/ScrapePDF/pdfscrapper.py b/ScrapePDF/pdfscrapper.py
@@ -0,0 +1,33 @@
+# import PyPDF2 library
+import PyPDF2 as p2
+
+pdffile = input("Enter path to pdf file you want to scrape: \n")
+PDFfile = open(pdffile, "rb")
+pdfread = p2.PdfFileReader(PDFfile)
+
+
+# to check wheather the pdf is encrypted or not
+print(pdfread.getIsEncrypted())
+ 
+
+# to get information about the document like creator, creation_date
+print(pdfread.getDocumentInfo())
+
+
+# to get number of pages
+print(pdfread.getNumPages())
+
+
+# To extract text from a singl page of pdf
+a = int(input("Enter the page no. from which you want to extract text: \n"))
+x = pdfread.getPage(a)
+print(x.extractText())
+
+
+# Extract entire pdf
+print("---------ENTIRE PDF----------")
+i = 0
+while i<pdfread.getNumPages():
+    pageinfo = pdfread.getPage(i)
+    print(pageinfo.extractText())
+    i += 1