Skip to content

PDF text extraction with Unstructured #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,16 @@ and in your python environment:

```
from datafog import PresidioEngine as presidio
datafog = datafog.DataFog()

```

## Examples

Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!

### Scanning a single string

```
ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."

Expand All @@ -93,6 +97,20 @@ Here are some examples of datafog being used to redact information in business c

```

### Scanning a list of PDFs

```
file_dir = ["/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf",
"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf"]
datafog = datafog.DataFog()
result = datafog.upload_files(uploaded_files=file_dir)
print(result)
```

The output here will be a dictionary where the keys are the file names and the values are the scan results for that file.
for ex:
`{'agi-builder-meetup.pdf': "2/26/24, 2:16 PM\nAGI Builders Meetup SF · Luma\nContact the HostReport Event29\nEvent FullIf youʼd like"}`

## Contributing

DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.
Expand Down
180 changes: 59 additions & 121 deletions examples/uploading-file-types.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,87 +26,16 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting datafog==2.3.2b10\n",
" Downloading datafog-2.3.2b10.tar.gz (13 kB)\n",
" Installing build dependencies ... \u001b[?25ldone\n",
"\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n",
"\u001b[?25h Installing backend dependencies ... \u001b[?25ldone\n",
"\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25hRequirement already satisfied: pandas==2.2.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from datafog==2.3.2b10) (2.2.1)\n",
"Requirement already satisfied: presidio-analyzer==2.2.353 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from datafog==2.3.2b10) (2.2.353)\n",
"Requirement already satisfied: pytest==8.0.2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from datafog==2.3.2b10) (8.0.2)\n",
"Requirement already satisfied: Requests==2.31.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from datafog==2.3.2b10) (2.31.0)\n",
"Requirement already satisfied: spacy==3.4.4 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from datafog==2.3.2b10) (3.4.4)\n",
"Requirement already satisfied: en-spacy-pii-fast in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from datafog==2.3.2b10) (0.0.0)\n",
"Requirement already satisfied: numpy<2,>=1.23.2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pandas==2.2.1->datafog==2.3.2b10) (1.26.4)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pandas==2.2.1->datafog==2.3.2b10) (2.9.0.post0)\n",
"Requirement already satisfied: pytz>=2020.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pandas==2.2.1->datafog==2.3.2b10) (2024.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pandas==2.2.1->datafog==2.3.2b10) (2024.1)\n",
"Requirement already satisfied: regex in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from presidio-analyzer==2.2.353->datafog==2.3.2b10) (2023.12.25)\n",
"Requirement already satisfied: tldextract in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from presidio-analyzer==2.2.353->datafog==2.3.2b10) (5.1.2)\n",
"Requirement already satisfied: pyyaml in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from presidio-analyzer==2.2.353->datafog==2.3.2b10) (6.0.1)\n",
"Requirement already satisfied: phonenumbers<9.0.0,>=8.12 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from presidio-analyzer==2.2.353->datafog==2.3.2b10) (8.13.32)\n",
"Requirement already satisfied: iniconfig in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pytest==8.0.2->datafog==2.3.2b10) (2.0.0)\n",
"Requirement already satisfied: packaging in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pytest==8.0.2->datafog==2.3.2b10) (24.0)\n",
"Requirement already satisfied: pluggy<2.0,>=1.3.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pytest==8.0.2->datafog==2.3.2b10) (1.4.0)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from Requests==2.31.0->datafog==2.3.2b10) (2.1.1)\n",
"Requirement already satisfied: idna<4,>=2.5 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from Requests==2.31.0->datafog==2.3.2b10) (3.6)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from Requests==2.31.0->datafog==2.3.2b10) (2.2.1)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from Requests==2.31.0->datafog==2.3.2b10) (2024.2.2)\n",
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (3.0.12)\n",
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (1.0.5)\n",
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (1.0.10)\n",
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (2.0.8)\n",
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (3.0.9)\n",
"Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (8.1.12)\n",
"Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (0.10.1)\n",
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (2.4.8)\n",
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (2.0.10)\n",
"Requirement already satisfied: typer<0.8.0,>=0.3.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (0.7.0)\n",
"Requirement already satisfied: pathy>=0.3.5 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (0.11.0)\n",
"Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (6.4.0)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (4.66.2)\n",
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (1.10.14)\n",
"Requirement already satisfied: jinja2 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (3.1.3)\n",
"Requirement already satisfied: setuptools in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (65.5.0)\n",
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from spacy==3.4.4->datafog==2.3.2b10) (3.3.0)\n",
"Requirement already satisfied: pathlib-abc==0.1.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pathy>=0.3.5->spacy==3.4.4->datafog==2.3.2b10) (0.1.1)\n",
"Requirement already satisfied: typing-extensions>=4.2.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy==3.4.4->datafog==2.3.2b10) (4.10.0)\n",
"Requirement already satisfied: six>=1.5 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas==2.2.1->datafog==2.3.2b10) (1.16.0)\n",
"Requirement already satisfied: blis<0.8.0,>=0.7.8 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from thinc<8.2.0,>=8.1.0->spacy==3.4.4->datafog==2.3.2b10) (0.7.11)\n",
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from thinc<8.2.0,>=8.1.0->spacy==3.4.4->datafog==2.3.2b10) (0.1.4)\n",
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from typer<0.8.0,>=0.3.0->spacy==3.4.4->datafog==2.3.2b10) (8.1.7)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from jinja2->spacy==3.4.4->datafog==2.3.2b10) (2.1.5)\n",
"Requirement already satisfied: requests-file>=1.4 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from tldextract->presidio-analyzer==2.2.353->datafog==2.3.2b10) (2.0.0)\n",
"Requirement already satisfied: filelock>=3.0.8 in /Users/sidmohan/Desktop/datafog-pypi-v2.3.2/.venv/lib/python3.11/site-packages (from tldextract->presidio-analyzer==2.2.353->datafog==2.3.2b10) (3.13.1)\n",
"Building wheels for collected packages: datafog\n",
" Building wheel for datafog (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for datafog: filename=datafog-2.3.2b10-py3-none-any.whl size=10839 sha256=98c6651a54b1e3b5d878d59fa534c7c8c22e1e6d4a49f04b43d4e447b9bd7e90\n",
" Stored in directory: /Users/sidmohan/Library/Caches/pip/wheels/a2/87/a5/513ca3a2ad3d826f945f1277a85346ae1bfd4d6261bb202b2d\n",
"Successfully built datafog\n",
"Installing collected packages: datafog\n",
" Attempting uninstall: datafog\n",
" Found existing installation: datafog 2.3.2b9\n",
" Uninstalling datafog-2.3.2b9:\n",
" Successfully uninstalled datafog-2.3.2b9\n",
"Successfully installed datafog-2.3.2b10\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"outputs": [],
"source": [
"# Initialize\n",
"%pip install datafog==2.3.2b10\n",
"%pip install datafog==2.4.0b4\n",
"import json\n",
"\n",
"import requests\n",
"import datafog\n",
"from datafog import PresidioEngine as presidio\n",
"import pandas as pd"
]
Expand All @@ -120,36 +49,9 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" uuid \\\n",
"0 a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6 \n",
"1 q9w8e7r6-t5y4-u3i2-o1p0-a9s8d7f6g5h4 \n",
"2 z1x2c3v4-b5n6-m7q8-w9e0-r1t2y3u4i5o6 \n",
"3 p1o2i3u4-y5t6-r7e8-w9q0-a1s2d3f4g5h6 \n",
"4 l1k2j3h4-g5f6-d7s8-a9q0-w1e2r3t4y5u6 \n",
"\n",
" text_chunk \\\n",
"0 Cisco to Acquire Splunk, to Help Make Organiza... \n",
"1 Cisco intends to acquire Splunk for $157 per s... \n",
"2 Our combined capabilities will drive the next ... \n",
"3 Tidal Partners LLC is acting as financial advi... \n",
"4 Cisco will host a conference call for Thursday... \n",
"\n",
" doc_source \n",
"0 CEO_Google_Drive_Press_Release_Draft.docx \n",
"1 CEO_Google_Drive_Press_Release_Draft.docx \n",
"2 CEO_Google_Drive_Press_Release_Draft.docx \n",
"3 CEO_Google_Drive_Press_Release_Draft.docx \n",
"4 CEO_Google_Drive_Press_Release_Draft.docx \n"
]
}
],
"outputs": [],
"source": [
"# Load the JSON data from the URL\n",
"url = \"https://gist.githubusercontent.com/sidmohan0/757185e0b9ff63fe00096baa0ce3fa45/raw/cb30da88e985d171bef281c927434cac52c239ea/sample.json\"\n",
Expand All @@ -165,7 +67,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -200,25 +102,61 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 [type: CUSTOM_PII, start: 0, end: 5, score: 1....\n",
"1 [type: CUSTOM_PII, start: 0, end: 5, score: 1....\n",
"2 [type: CUSTOM_PII, start: 41, end: 56, score: ...\n",
"3 [type: CUSTOM_PII, start: 0, end: 18, score: 1...\n",
"4 [type: CUSTOM_PII, start: 0, end: 5, score: 1....\n",
"Name: scan_results, dtype: object\n"
]
}
],
"outputs": [],
"source": [
"print(df[\"scan_results\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PDF"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Email confirmation for a event meetup\n",
"# input_file = \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf\"\n",
"\n",
"# readthedocs for PyPDF\n",
"input_file = \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf\"\n",
"\n",
"\n",
"# input_file = \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf\"\n",
"output = datafog.DataFog.upload_file(uploaded_file_path=input_file)\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple PDFs\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"file_dir = [\n",
" \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf\",\n",
" \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf\",\n",
"]\n",
"datafog = datafog.DataFog()\n",
"result = datafog.upload_files(uploaded_files=file_dir)\n",
"print(result)"
]
}
],
"metadata": {
Expand All @@ -237,7 +175,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
"version": "3.10.1"
}
},
"nbformat": 4,
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ aiohttp==3.8.2
yarl==1.8.1
frozenlist==1.3.1
en_spacy_pii_fast

unstructured[pdf]
unstructured[pptx]


2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@


def __version__():
return "2.3.2"
return "2.4.0"


project_urls = {
Expand Down
44 changes: 44 additions & 0 deletions src/datafog/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# datafog-python/src/datafog/__init__.py
import json
import logging
import tempfile
from pathlib import Path
from typing import List

import pandas as pd
import requests
import spacy
from unstructured.partition.auto import partition

from .__about__ import __version__
from .pii_tools import PresidioEngine
Expand All @@ -29,6 +33,7 @@
nlp (spacy.lang): Spacy language model for PII detection.
"""

# Maintaining support
def __init__(self):
"""
Initialize the DataFog instance.
Expand All @@ -47,6 +52,45 @@
"""
return DataFog()

@staticmethod
def upload_file(uploaded_file_path):
uploaded_file_path = Path(uploaded_file_path)
bytes_data = uploaded_file_path.read_bytes()
texts = {}

if not uploaded_file_path.exists():
return "File not found."

Check warning on line 62 in src/datafog/__init__.py

View check run for this annotation

Codecov / codecov/patch

src/datafog/__init__.py#L62

Added line #L62 was not covered by tests
else:

temp_file = tempfile.NamedTemporaryFile(
delete=True, suffix=uploaded_file_path.suffix
)
temp_file.write(bytes_data)
elements = partition(temp_file.name)
text = ""
for element in elements:
text += element.text + "\n"
texts[uploaded_file_path.name] = text

return texts

@staticmethod
def upload_files(uploaded_files: List[str]):
"""
Process uploaded files.

Args:
uploaded_files (List[str]): A list of file paths uploaded by the user.

Returns:
Dict[str, str]: A dictionary containing the processed text for each uploaded file.
"""
texts = {}
for uploaded_file in uploaded_files:
result = DataFog.upload_file(uploaded_file)
texts.update(result)
return texts

def __call__(self, input_source, privacy_operation):
"""
Process the input data and apply the specified privacy operation.
Expand Down
Binary file added tests/files/input_files/agi-builder-meetup.pdf
Binary file not shown.
Binary file not shown.
Loading
Loading