Skip to content

Commit 5ded064

Browse files
Sid MohanSid Mohan
Sid Mohan
authored and
Sid Mohan
committed
pre-commit passed
1 parent 661c20b commit 5ded064

File tree

3 files changed

+38
-92
lines changed

3 files changed

+38
-92
lines changed

.flake8

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
ignore = E203, E266, E501, W503, B006, B007, B008, F401, C416, B950, B904
33
max-line-length = 88
44
max-complexity = 18
5-
select = B,C,E,F,W,T4,B9
5+
select = B,C,E,F,W,T4,B9
6+
exclude = venv, .venv, tests/.datafog_env, examples/venv

README.md

+35-91
Original file line numberDiff line numberDiff line change
@@ -19,18 +19,8 @@
1919

2020
## Overview
2121

22-
### What is DataFog?
23-
2422
DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.
2523

26-
### Core Problem
27-
28-
![image](https://github.com/DataFog/datafog-python/assets/61345237/57fba4e5-21cc-458f-ac6a-6fbbb70a8de1)
29-
30-
### How it works
31-
32-
![image](https://github.com/DataFog/datafog-python/assets/61345237/91f4634a-8a9f-4621-81bc-09930feda78a)
33-
3424
## Installation
3525

3626
DataFog can be installed via pip:
@@ -41,112 +31,66 @@ pip install datafog
4131

4232
## Getting Started
4333

44-
The DataFog library provides functionality for text and image processing, including PII (Personally Identifiable Information) annotation and OCR (Optical Character Recognition) capabilities.
45-
46-
### Installation
47-
48-
To install the DataFog library, use the following command:
49-
50-
```
51-
pip install datafog
52-
```
53-
54-
### Usage
34+
To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:
5535

56-
The [Getting Started notebook](/examples/getting_started.ipynb) features a standalone Colab notebook.
36+
```python
37+
from datafog import DataFog
5738

58-
#### Text PII Annotation
59-
60-
To annotate PII in a given text, lets start with a set of clinical notes:
39+
# For text annotation
40+
client = DataFog(operations="annotate_pii")
6141

42+
# For OCR (Optical Character Recognition)
43+
ocr_client = DataFog(operations="extract_text")
6244
```
63-
!git clone https://gist.github.com/b43b72693226422bac5f083c941ecfdb.git
64-
# Define the directory path
65-
folder_path = 'clinical_notes/'
6645

67-
# List all files in the directory
68-
file_list = os.listdir(folder_path)
69-
text_files = sorted([file for file in file_list if file.endswith('.txt')])
46+
### Text PII Annotation
7047

71-
with open(os.path.join(folder_path, text_files[0]), 'r') as file:
72-
clinical_note = file.read()
48+
Here's an example of how to annotate PII in a text document:
7349

74-
display(Markdown(clinical_note))
75-
```
50+
```python
51+
import requests
7652

77-
which looks like this:
53+
# Fetch sample medical record
54+
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
55+
response = requests.get(doc_url)
56+
text_lines = [line for line in response.text.splitlines() if line.strip()]
7857

58+
# Run annotation
59+
annotations = client.run_text_pipeline_sync(str_list=text_lines)
60+
print(annotations)
7961
```
8062

81-
**Date:** April 10, 2024
82-
83-
**Patient:** Emily Johnson, 35 years old
84-
85-
**MRN:** 00987654
86-
87-
**Chief Complaint:** "I've been experiencing severe back pain and numbness in my legs."
88-
89-
**History of Present Illness:** The patient is a 35-year-old who presents with a 2-month history of worsening back pain, numbness in both legs, and occasional tingling sensations. The patient reports working as a freelance writer and has been experiencing increased stress due to tight deadlines and financial struggles.
63+
### OCR PII Annotation
9064

91-
**Past Medical History:** Hypothyroidism
65+
For OCR capabilities, you can use the following:
9266

93-
**Social History:**
94-
The patient shares a small apartment with two roommates and relies on public transportation. They mention feeling overwhelmed with work and personal responsibilities, often sacrificing sleep to meet deadlines. The patient expresses concern over the high cost of healthcare and the need for affordable medication options.
67+
```python
68+
import asyncio
69+
import nest_asyncio
9570

96-
**Review of Systems:** Denies fever, chest pain, or shortness of breath. Reports occasional headaches.
71+
nest_asyncio.apply()
9772

98-
**Physical Examination:**
99-
- General: Appears tired but is alert and oriented.
100-
- Vitals: BP 128/80, HR 72, Temp 98.6°F, Resp 14/min
10173

102-
**Assessment/Plan:**
103-
- Continue to monitor blood pressure and thyroid function.
104-
- Discuss affordable medication options with a pharmacist.
105-
- Refer to a social worker to address housing concerns and access to healthcare services.
106-
- Encourage the patient to engage with community support groups for social support.
107-
- Schedule a follow-up appointment in 4 weeks or sooner if symptoms worsen.
108-
109-
**Comments:** The patient's health concerns are compounded by socioeconomic factors, including employment status, housing stability, and access to healthcare. Addressing these social determinants of health is crucial for improving the patient's overall well-being.
110-
111-
```
112-
113-
we can then set up our pipeline to accept these files
114-
115-
```
116-
async def run_text_pipeline_demo():
117-
results = await datafog.run_text_pipeline(texts)
118-
print("Text Pipeline Results:", results)
119-
return results
74+
async def run_ocr_pipeline_demo():
75+
image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
76+
results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
77+
print("OCR Pipeline Results:", results)
12078

12179

122-
texts = [clinical_note]
12380
loop = asyncio.get_event_loop()
124-
results = loop.run_until_complete(run_text_pipeline_demo())
125-
```
126-
127-
Note: The DataFog library uses asynchronous programming, so make sure to use the `async`/`await` syntax when calling the appropriate methods.
128-
129-
#### OCR PII Annotation
130-
131-
Let's use a image (which could easily be a converted or scanned PDF)
132-
133-
![Executive Email](https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg)
134-
81+
loop.run_until_complete(run_ocr_pipeline_demo())
13582
```
136-
datafog = DataFog(operations='extract_text')
137-
url_list = ['https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg']
13883

139-
async def run_ocr_pipeline_demo():
140-
results = await datafog.run_ocr_pipeline(url_list)
141-
print("OCR Pipeline Results:", results)
84+
Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the `async`/`await` syntax when calling the appropriate methods.
14285

143-
loop = asyncio.get_event_loop()
144-
loop.run_until_complete(run_ocr_pipeline_demo())
86+
## Examples
14587

146-
```
88+
For more detailed examples, check out our Jupyter notebooks in the `examples/` directory:
14789

148-
You'll notice that we use async functions liberally throughout the SDK - given the nature of the functions we're providing and the extension of DataFog into API/other formats, this allows the functions to be more easily adapted for those uses.
90+
- `text_annotation_example.ipynb`: Demonstrates text PII annotation
91+
- `image_processing.ipynb`: Shows OCR capabilities and text extraction from images
14992

93+
These notebooks provide step-by-step guides on how to use DataFog for various tasks.
15094

15195
### Dev Notes
15296

requirements-dev.txt

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
just
66
isort
77
black
8+
blacken-docs
89
flake8
910
tox
1011
pytest

0 commit comments

Comments
 (0)