Skip to content

Commit d076499

Browse files
Sid MohanSid Mohan
authored andcommitted
updated README and notebook
1 parent a122c66 commit d076499

File tree

2 files changed

+9
-191
lines changed

2 files changed

+9
-191
lines changed

README.md

Lines changed: 3 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -39,11 +39,6 @@ DataFog can be installed via pip:
3939
pip install datafog
4040
```
4141

42-
## Examples -
43-
44-
### v3.2.0 NEW
45-
46-
Based on the provided test cases, here's a suitable "Getting Started" section for the documentation:
4742

4843
## Getting Started
4944

@@ -59,7 +54,8 @@ pip install datafog
5954

6055
### Usage
6156

62-
Here are some examples of how to use the DataFog library:
57+
The [Getting Started notebook](/datafog-python/examples/getting_started.ipynb) features a standalone Colab notebook that lets you get up and running in no time.
58+
6359

6460
#### Text PII Annotation
6561

@@ -87,7 +83,7 @@ To extract text from an image and perform PII annotation, you can use the `DataF
8783
```python
8884
from datafog import DataFog
8985

90-
image_url = "https://example.com/image.png"
86+
image_url = "https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg"
9187
datafog = DataFog()
9288
annotated_text = await datafog.run_ocr_pipeline([image_url])
9389
print(annotated_text)
@@ -114,82 +110,7 @@ For more detailed usage and examples, please refer to the API documentation.
114110

115111
Note: The DataFog library uses asynchronous programming, so make sure to use the `async`/`await` syntax when calling the appropriate methods.
116112

117-
### v3.1.0
118-
119-
### Base case: PII annotation of text-files
120-
121-
```
122-
from datafog import OCRPIIAnnotator, TextPIIAnnotator
123-
import json
124-
import requests
125-
126-
response = requests.get('https://gist.githubusercontent.com/sidmohan0/1aa3ec38b4e6594d3c34b113f2e0962d/raw/42e57146197be0f85a5901cd1dcdd9ad15b31bab/sotu_2023.txt')
127-
response.raise_for_status() # Ensure the request was successful
128-
text = response.text
129-
# print(text)
130-
text_annotator = TextPIIAnnotator()
131-
annotated_text = text_annotator.run(text, output_path=f"sotu_2023_output.json")
132-
print("Annotated Text:", annotated_text)
133-
```
134-
135-
### OCR Reference Set (Images)
136-
137-
```
138-
image_set = {
139-
"medical_invoice": "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png",
140-
"sales_receipt": "https://templates.invoicehome.com/sales-receipt-template-us-classic-white-750px.png",
141-
"press_release": "https://newsroom.cisco.com/c/dam/r/newsroom/en/us/assets/a/y2023/m09/cisco_splunk_1200x675_v3.png",
142-
"insurance_claim_scanned_form": "https://www.pdffiller.com/preview/101/35/101035394.png",
143-
"scanned_internal_record": "https://www.pdffiller.com/preview/435/972/435972694.png",
144-
"executive_email": "https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg"
145-
}
146-
147-
```
148-
149-
### OCR text extraction from images + PII annotation
150-
151-
with this, you can then run the following steps:
152-
153-
```
154-
from datafog import OCRPIIAnnotator, TextPIIAnnotator
155-
import json
156-
157-
image_url = image_set["executive_email"]
158-
159-
annotator = OCRPIIAnnotator()
160-
annotated_text = annotator.run(image_url, output_path=f"executive_email_output.json")
161-
print("Annotated Text:", annotated_text)
162113

163-
```
164-
165-
and the output should look like this:
166-
167-
```
168-
Annotated Text: {'DATE_TIME': ['Wednesday', 'June 12, 2019'], 'LOC': [], 'NRP': [], 'ORG': [], 'PER': ['Kevin Scott Sent', 'Satya Nadella', 'Bill Gates Subject', 'Thoughts']}
169-
170-
```
171-
172-
### With PySpark
173-
174-
Note: as of 3.1.0, you'll need to start the Spark session by instancing the DataFog class as shown below
175-
176-
```
177-
from datafog import DataFog
178-
from datafog.pii_annotation import ImageProcessor
179-
datafog = DataFog()
180-
181-
# let's process the images that we shared above
182-
processed_images = [(name, ImageProcessor().download_image(url=image_url)) for name, image_url in image_set.items()]
183-
184-
from datafog.pii_annotation import SparkService
185-
parsed_images = [(name, ImageProcessor().parse_image(img)) for name, img in processed_images]
186-
187-
df = SparkService().spark.createDataFrame(parsed_images, ["image_name", "parsed_data"])
188-
189-
# Display DataFrame
190-
df.show(truncate=False)
191-
192-
```
193114

194115
## Contributing
195116

examples/getting_started.ipynb

Lines changed: 6 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -23,88 +23,9 @@
2323
},
2424
{
2525
"cell_type": "code",
26-
"execution_count": 1,
26+
"execution_count": null,
2727
"metadata": {},
28-
"outputs": [
29-
{
30-
"name": "stdout",
31-
"output_type": "stream",
32-
"text": [
33-
"Collecting datafog==3.2.0b20\n",
34-
" Downloading datafog-3.2.0b20.tar.gz (15 kB)\n",
35-
" Installing build dependencies ... \u001b[?25ldone\n",
36-
"\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n",
37-
"\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n",
38-
"\u001b[?25hRequirement already satisfied: pandas in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (2.0.3)\n",
39-
"Requirement already satisfied: Requests==2.31.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (2.31.0)\n",
40-
"Requirement already satisfied: spacy==3.4.4 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (3.4.4)\n",
41-
"Requirement already satisfied: en-spacy-pii-fast==0.0.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (0.0.0)\n",
42-
"Requirement already satisfied: pyspark==3.4.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (3.4.1)\n",
43-
"Requirement already satisfied: pydantic==1.10.8 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (1.10.8)\n",
44-
"Requirement already satisfied: Pillow in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (8.4.0)\n",
45-
"Requirement already satisfied: sentencepiece in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (0.2.0)\n",
46-
"Requirement already satisfied: protobuf in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (4.25.3)\n",
47-
"Requirement already satisfied: pytesseract in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (0.3.10)\n",
48-
"Requirement already satisfied: aiohttp in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (3.9.5)\n",
49-
"Requirement already satisfied: pytest-asyncio in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from datafog==3.2.0b20) (0.23.6)\n",
50-
"Requirement already satisfied: typing-extensions>=4.2.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pydantic==1.10.8->datafog==3.2.0b20) (4.11.0)\n",
51-
"Requirement already satisfied: py4j==0.10.9.7 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pyspark==3.4.1->datafog==3.2.0b20) (0.10.9.7)\n",
52-
"Requirement already satisfied: charset-normalizer<4,>=2 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from Requests==2.31.0->datafog==3.2.0b20) (2.0.12)\n",
53-
"Requirement already satisfied: idna<4,>=2.5 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from Requests==2.31.0->datafog==3.2.0b20) (3.3)\n",
54-
"Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from Requests==2.31.0->datafog==3.2.0b20) (1.26.7)\n",
55-
"Requirement already satisfied: certifi>=2017.4.17 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from Requests==2.31.0->datafog==3.2.0b20) (2021.10.8)\n",
56-
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (3.0.12)\n",
57-
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (1.0.5)\n",
58-
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (1.0.10)\n",
59-
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (2.0.8)\n",
60-
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (3.0.9)\n",
61-
"Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (8.1.12)\n",
62-
"Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (0.10.1)\n",
63-
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (2.4.8)\n",
64-
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (2.0.10)\n",
65-
"Requirement already satisfied: typer<0.8.0,>=0.3.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (0.7.0)\n",
66-
"Requirement already satisfied: pathy>=0.3.5 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (0.11.0)\n",
67-
"Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (5.2.1)\n",
68-
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (4.66.4)\n",
69-
"Requirement already satisfied: numpy>=1.15.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (1.25.2)\n",
70-
"Requirement already satisfied: jinja2 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (3.1.3)\n",
71-
"Requirement already satisfied: setuptools in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (69.1.1)\n",
72-
"Requirement already satisfied: packaging>=20.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (23.2)\n",
73-
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from spacy==3.4.4->datafog==3.2.0b20) (3.3.0)\n",
74-
"Requirement already satisfied: aiosignal>=1.1.2 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from aiohttp->datafog==3.2.0b20) (1.3.1)\n",
75-
"Requirement already satisfied: attrs>=17.3.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from aiohttp->datafog==3.2.0b20) (22.2.0)\n",
76-
"Requirement already satisfied: frozenlist>=1.1.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from aiohttp->datafog==3.2.0b20) (1.3.3)\n",
77-
"Requirement already satisfied: multidict<7.0,>=4.5 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from aiohttp->datafog==3.2.0b20) (6.0.4)\n",
78-
"Requirement already satisfied: yarl<2.0,>=1.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from aiohttp->datafog==3.2.0b20) (1.8.2)\n",
79-
"Requirement already satisfied: python-dateutil>=2.8.2 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pandas->datafog==3.2.0b20) (2.8.2)\n",
80-
"Requirement already satisfied: pytz>=2020.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pandas->datafog==3.2.0b20) (2021.3)\n",
81-
"Requirement already satisfied: tzdata>=2022.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pandas->datafog==3.2.0b20) (2024.1)\n",
82-
"Requirement already satisfied: pytest<9,>=7.0.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pytest-asyncio->datafog==3.2.0b20) (7.4.4)\n",
83-
"Requirement already satisfied: pathlib-abc==0.1.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pathy>=0.3.5->spacy==3.4.4->datafog==3.2.0b20) (0.1.1)\n",
84-
"Requirement already satisfied: iniconfig in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pytest<9,>=7.0.0->pytest-asyncio->datafog==3.2.0b20) (2.0.0)\n",
85-
"Requirement already satisfied: pluggy<2.0,>=0.12 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from pytest<9,>=7.0.0->pytest-asyncio->datafog==3.2.0b20) (1.4.0)\n",
86-
"Requirement already satisfied: six>=1.5 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas->datafog==3.2.0b20) (1.16.0)\n",
87-
"Requirement already satisfied: blis<0.8.0,>=0.7.8 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from thinc<8.2.0,>=8.1.0->spacy==3.4.4->datafog==3.2.0b20) (0.7.11)\n",
88-
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from thinc<8.2.0,>=8.1.0->spacy==3.4.4->datafog==3.2.0b20) (0.1.4)\n",
89-
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from typer<0.8.0,>=0.3.0->spacy==3.4.4->datafog==3.2.0b20) (8.1.7)\n",
90-
"Requirement already satisfied: MarkupSafe>=2.0 in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (from jinja2->spacy==3.4.4->datafog==3.2.0b20) (2.1.5)\n",
91-
"Building wheels for collected packages: datafog\n",
92-
" Building wheel for datafog (pyproject.toml) ... \u001b[?25ldone\n",
93-
"\u001b[?25h Created wheel for datafog: filename=datafog-3.2.0b20-py3-none-any.whl size=16437 sha256=85772be41af732abed8ff3306701762a3d9df24129bf2a6392ea7accf0f99467\n",
94-
" Stored in directory: /Users/sidmohan/Library/Caches/pip/wheels/e3/1d/bb/ac5c7ef27ba420864a19f0c53491bd68324cbb71082b15b3e4\n",
95-
"Successfully built datafog\n",
96-
"Installing collected packages: datafog\n",
97-
" Attempting uninstall: datafog\n",
98-
" Found existing installation: datafog 3.2.0b12\n",
99-
" Uninstalling datafog-3.2.0b12:\n",
100-
" Successfully uninstalled datafog-3.2.0b12\n",
101-
"Successfully installed datafog-3.2.0b20\n",
102-
"\n",
103-
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
104-
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
105-
]
106-
}
107-
],
28+
"outputs": [],
10829
"source": [
10930
"!pip install \"datafog==3.2.0\""
11031
]
@@ -118,43 +39,19 @@
11839
},
11940
{
12041
"cell_type": "code",
121-
"execution_count": 2,
42+
"execution_count": null,
12243
"metadata": {},
123-
"outputs": [
124-
{
125-
"name": "stdout",
126-
"output_type": "stream",
127-
"text": [
128-
"The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.\n",
129-
"Please visit http://www.java.com for information on installing Java.\n",
130-
"\n",
131-
"The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.\n",
132-
"Please visit http://www.java.com for information on installing Java.\n",
133-
"\n"
134-
]
135-
}
136-
],
44+
"outputs": [],
13745
"source": [
13846
"! apt install tesseract-ocr\n",
13947
"! apt install libtesseract-dev"
14048
]
14149
},
14250
{
14351
"cell_type": "code",
144-
"execution_count": 3,
52+
"execution_count": null,
14553
"metadata": {},
146-
"outputs": [
147-
{
148-
"name": "stdout",
149-
"output_type": "stream",
150-
"text": [
151-
"Requirement already satisfied: nest_asyncio in /Users/sidmohan/.pyenv/versions/3.11.7/envs/2.2.0b1/lib/python3.11/site-packages (1.6.0)\n",
152-
"\n",
153-
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
154-
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
155-
]
156-
}
157-
],
54+
"outputs": [],
15855
"source": [
15956
"!pip install nest_asyncio"
16057
]

0 commit comments

Comments
 (0)