Skip to content

Commit 80c7d7e

Browse files
authored
Merge pull request #272 from Carlonii/pdftotext
PDF To Text
2 parents d8f40bf + f2565cb commit 80c7d7e

File tree

4 files changed

+106
-0
lines changed

4 files changed

+106
-0
lines changed

PDF to text/Atividade 28 Fev.pdf

28 KB
Binary file not shown.

PDF to text/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# PDF to Text Converter
2+
3+
This project is a Python tool designed to convert PDF files into clean and readable text. It is built to extract text from both local and remote PDFs, perform post-processing to improve readability, and save the formatted content into `.txt` files. The project also includes features for downloading PDFs from URLs and cleaning up the extracted text to prevent issues with line breaks and disorganized spacing.
4+
5+
---
6+
7+
## Features
8+
1. **Text Extraction from Local and Remote PDFs**:
9+
- Supports PDF files stored locally and PDFs available via URL.
10+
2. **Text Cleaning and Formatting**:
11+
- Removes unwanted line breaks and excessive spacing.
12+
- Preserves paragraphs and maintains the original structure.
13+
3. **Saving Extracted Text as `.txt` Files**:
14+
- The extracted text can be saved as a `.txt` file with the same name as the original PDF.
15+
4. **Automatic Output Folder Creation**:
16+
- Organizes generated text files into an `output_texts` folder for easy navigation and future use.
17+
18+
## Requirements
19+
20+
Make sure to have the following libraries installed:
21+
22+
- `requests`
23+
- `PyPDF2`
24+
25+
If you do not have them yet, install them using:
26+
27+
```bash
28+
pip install requests PyPDF2

PDF to text/script.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
import os
2+
import re
3+
import requests
4+
import PyPDF2
5+
6+
def download_pdf(url, local_filename):
7+
"""Download PDF from a URL to a local file."""
8+
response = requests.get(url)
9+
with open(local_filename, 'wb') as f:
10+
f.write(response.content)
11+
12+
def extract_text_from_pdf(pdf_path):
13+
"""Extract text from a single PDF file."""
14+
try:
15+
with open(pdf_path, 'rb') as file:
16+
reader = PyPDF2.PdfReader(file)
17+
text = ""
18+
for page in reader.pages:
19+
text += page.extract_text() or ""
20+
# Apply text cleaning after extraction
21+
return clean_extracted_text(text)
22+
except Exception as e:
23+
print(f"Failed to read {pdf_path}: {e}")
24+
return None
25+
26+
def clean_extracted_text(text):
27+
"""Clean and format the extracted text."""
28+
# Remove line breaks in the middle of sentences
29+
cleaned_text = re.sub(r'(?<!\.)\n(?!\n)', ' ', text) # Replace single line breaks with space
30+
# Remove multiple spaces
31+
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
32+
# Preserve paragraphs by keeping double newlines
33+
cleaned_text = re.sub(r'\n{2,}', '\n\n', cleaned_text)
34+
return cleaned_text.strip()
35+
36+
def convert_pdf_to_txt(pdf_path, save_to_file=True, output_folder="output_texts"):
37+
"""Convert a single PDF to text, optionally saving to a file."""
38+
try:
39+
# Check if the path is a URL or local file
40+
if pdf_path.startswith("http"):
41+
# Download PDF to a temporary location
42+
local_pdf = os.path.join(output_folder, pdf_path.split('/')[-1])
43+
download_pdf(pdf_path, local_pdf)
44+
text = extract_text_from_pdf(local_pdf)
45+
os.remove(local_pdf) # Remove the temporary file
46+
else:
47+
# Handle local file
48+
text = extract_text_from_pdf(pdf_path)
49+
50+
if text:
51+
# Print the cleaned text
52+
print(f"\nExtracted text:\n{text}\n")
53+
54+
if save_to_file:
55+
# Save the extracted text to a .txt file
56+
if not os.path.exists(output_folder):
57+
os.makedirs(output_folder)
58+
base_name = os.path.splitext(os.path.basename(pdf_path))[0]
59+
output_file = os.path.join(output_folder, f"{base_name}.txt")
60+
with open(output_file, 'w', encoding='utf-8') as txt_file:
61+
txt_file.write(text)
62+
print(f"Text successfully saved to: {output_file}")
63+
else:
64+
print(f"Could not extract text from: {pdf_path}")
65+
except Exception as e:
66+
print(f"Error processing {pdf_path}: {e}")
67+
68+
# Example usage:
69+
70+
#example pdf from internet
71+
#pdf = "https://fase.org.br/wp-content/uploads/2014/05/exemplo-de-pdf.pdf"
72+
73+
#example local pdf
74+
pdf = "D:/repos/Python-Scripts/PDF to text/Atividade 28 Fev.pdf"
75+
76+
# Convert PDF to text and save the cleaned text to a file
77+
convert_pdf_to_txt(pdf)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ More information on contributing and the general code of conduct for discussion
8888
| Password Generator | [Password Generator](https://github.com/DhanushNehru/Python-Scripts/tree/master/Password%20Generator) | Generates a random password. |
8989
| Password Manager | [Password Manager](https://github.com/nem5345/Python-Scripts/tree/master/Password%20Manager) | Generate and interact with a password manager. |
9090
| PDF to Audio | [PDF to Audio](https://github.com/DhanushNehru/Python-Scripts/tree/master/PDF%20to%20Audio) | Converts PDF to audio. |
91+
| PDF to Text | [PDF to text](https://github.com/DhanushNehru/Python-Scripts/tree/master/PDF%20to%20text) | Converts PDF to text. |
9192
| Planet Simulation | [Planet Simulation](https://github.com/DhanushNehru/Python-Scripts/tree/master/Planet%20Simulation) | A simulation of several planets rotating around the sun.
9293
| Playlist Exchange | [Playlist Exchange](https://github.com/DhanushNehru/Python-Scripts/tree/master/Playlist%20Exchange) | A Python script to exchange songs and playlists between Spotify and Python.
9394
| PNG TO JPG CONVERTOR | [PNG-To-JPG](https://github.com/DhanushNehru/Python-Scripts/tree/master/PNG%20To%20JPG) | A PNG TO JPG IMAGE CONVERTOR.

0 commit comments

Comments
 (0)