Skip to content

Commit 5c5cb88

Browse files
authored
Merge pull request #14 from DataFog/v2.3.2
updated README.MD
2 parents 733a602 + 6314014 commit 5c5cb88

File tree

4 files changed

+81
-70
lines changed

4 files changed

+81
-70
lines changed

Diff for: .gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ build/
1313
/src/datafog/pii_tools/__pycache__/
1414
/tests/__pycache__/
1515
/tests/scratch.py
16-
node_modules
16+
node_modules/
1717
datafog_debug.log
1818
sotu_2023.txt
1919
.DS_Store

Diff for: README.md

+61-44
Original file line numberDiff line numberDiff line change
@@ -3,88 +3,105 @@
33
</p>
44

55
<p align="center">
6-
<b>Open-source PII Detection for Retrieval Systems</b>. <br />
7-
Scan, redact, and manage PII in your documents before they get uploaded to a Generative AI system.
6+
<b>Open-source DevSecOps for Generative AI Systems</b>. <br />
87
</p>
98

109
<p align="center">
1110
<a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/v/datafog.svg?style=flat-square" alt="PyPi Version"></a>
1211
<a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/pyversions/datafog.svg?style=flat-square" alt="PyPI pyversions"></a>
1312
<a href="https://github.com/datafog/datafog-python"><img src="https://img.shields.io/github/stars/datafog/datafog-python.svg?style=flat-square&logo=github&label=Stars&logoColor=white" alt="GitHub stars"></a>
1413
<a href="https://pypistats.org/packages/datafog"><img src="https://img.shields.io/pypi/dm/datafog.svg?style=flat-square" alt="PyPi downloads"></a>
15-
</p>
16-
17-
<p align="center">
14+
<a href="https://discord.gg/bzDth394R4"><img src="https://img.shields.io/discord/1173803135341449227?style=flat" alt="Discord"></a>
1815
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square" alt="Code style: black"></a>
19-
</p>
20-
21-
<p align="center">
2216
<a href="https://codecov.io/gh/datafog/datafog-python"><img src="https://img.shields.io/codecov/c/github/datafog/datafog-python.svg?style=flat-square" alt="codecov"></a>
17+
<a href="https://github.com/datafog/datafog-python/issues"><img src="https://img.shields.io/github/issues/datafog/datafog-python.svg?style=flat-square" alt="GitHub Issues"></a>
2318
</p>
2419

2520
## Overview
2621

27-
DataFog works by scanning and redacting-out PII in files **before** they get uploaded to a Generative AI system.
22+
### What is DataFog?
23+
24+
DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.
25+
26+
### What problem are we solving?
27+
28+
**Context**
29+
30+
The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!
31+
32+
**Problem**
33+
34+
How do you keep:
35+
36+
- Customer PII
37+
- Employee PII
38+
- Sensitive company information pertaining to org changes or restructurings
39+
- Pending M&A activity
40+
- Conversations with external counsel on material corporate matters (i.e. product recall, etc)
41+
- and more
42+
43+
from entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.
2844

29-
## How it works
45+
That's where DataFog comes in. Our solution to this problem is through two major approaches:
3046

31-
<img src="https://www.datafog.ai/hero.png" alt="DataFog Overview">
47+
**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities
48+
**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database
49+
50+
With this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running.
51+
52+
### How it works
53+
54+
<img src="https://www.datafog.ai/hero.png" alt="DataFog Overview" style="width:50%;">
55+
56+
### There's lots of PII tools out there; why DataFog?
57+
58+
If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').
59+
In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products
60+
are purpose-built for the problem that they are solving.
61+
62+
However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.
3263

3364
## Installation
3465

3566
DataFog can be installed via pip:
3667

3768
```bash
38-
pip install datafog # python client
69+
pip install datafog
3970
```
4071

41-
## Usage
42-
43-
We're going to build up functionality starting with support for the Microsoft Presidio library. If you have any custom requests that would be of benefit to the community, please let us know!
72+
and in your python environment:
4473

4574
```
46-
import requests
47-
from datafog import PresidioEngine as presidio
48-
49-
# Example: Detecting PII in a String
50-
pii_detected = presidio.scan("My name is John Doe and my email is [email protected]")
51-
print("PII Detected:", pii_detected)
52-
53-
# Example: Detecting PII in a File
54-
sample_filepath = "/Users/sidmohan/Desktop/v2.0.0/datafog-python/tests/files/input_files/sample.csv"
55-
with open(sample_filepath, "r") as f:
56-
original_value = f.read()
57-
pii_detected = presidio.scan(original_value)
58-
print("PII Detected in File:", pii_detected)
59-
60-
# Example: Detecting PII in a URL
61-
sample_url = "https://gist.githubusercontent.com/sidmohan0/1aa3ec38b4e6594d3c34b113f2e0962d/raw/42e57146197be0f85a5901cd1dcdd9ad15b31bab/sotu_2023.txt"
62-
response = requests.get(sample_url)
63-
original_value = response.text
64-
pii_detected = presidio.scan(original_value)
65-
print("PII Detected in URL Content:", pii_detected)
66-
75+
from datafog import PresidioEngine as presidio
6776
```
6877

69-
Depending on your input, the output will be a list of detected PII entities:
78+
## Examples
79+
80+
Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!
7081

7182
```
72-
PII Detected: [type: EMAIL_ADDRESS, start: 36, end: 53, score: 1.0, type: PERSON, start: 11, end: 19, score: 0.85, type: URL, start: 44, end: 53, score: 0.5]
83+
ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."
84+
85+
scan_results1 = presidio.scan(ceo_email_chunk)
86+
print("PII Detected - base case:", scan_results1)
87+
# PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]
88+
89+
90+
scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO'])
91+
print("PII Detected with deny list:", scan_results2)
92+
# PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]
93+
7394
```
7495

7596
## Contributing
7697

77-
This is an open-source project and we welcome contributions. If you have any questions, please feel free to reach out to us, join our Discord or email me directly at [email protected].
98+
DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.
7899

79100
### Dev Notes
80101

81-
- Clone repo
82-
- Run 'poetry install' to install dependencies (recommend entering poetry shell for preserving dependencies)
83102
- Justfile commands:
84103
- `just format` to apply formatting.
85104
- `just lint` to check formatting and style.
86-
- `just tag` to tag your project on git
87-
- `just upload` to publish to PyPi.
88105

89106
### Testing
90107

@@ -96,7 +113,7 @@ tox
96113
97114
```
98115

99-
### License
116+
## License
100117

101118
This software is published under the [MIT
102119
license](https://en.wikipedia.org/wiki/MIT_License).

Diff for: examples/getting-started.ipynb

+19-25
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,14 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# <img src=\"../public/colorlogo.png\" width=\"20%\"/>\n"
7+
"<img src=\"../public/colorlogo.png\" width=\"50%\"/>\n",
8+
"\n",
9+
"\n",
10+
"[Homepage](https://www.datafog.ai) | \n",
11+
"[Discord](https://discord.gg/bzDth394R4) | \n",
12+
"[Github](https://github.com/datafog/datafog-python) | \n",
13+
"[Contact](mailto:[email protected])\n",
14+
"\n"
815
]
916
},
1017
{
@@ -20,10 +27,15 @@
2027
"source": [
2128
"Welcome to the DataFog Python SDK! This notebook will walk you through the basics of using the SDK to scan text for PII.\n",
2229
"\n",
23-
"Please consider this a living document - we will be focused on adding new content and guides to make using DataFog as easy as possible! If you have any questions or need help, please reach out to us on our Discord or email us at [email protected].\n",
30+
"Please consider this a living document - we will be focused on adding new content and guides to make using DataFog as easy as possible! If you have any questions or need help, please reach out to us on our [Discord](https://discord.gg/bzDth394R4) or email us at [email protected].\n",
2431
"\n"
2532
]
2633
},
34+
{
35+
"cell_type": "markdown",
36+
"metadata": {},
37+
"source": []
38+
},
2739
{
2840
"cell_type": "markdown",
2941
"metadata": {},
@@ -42,7 +54,7 @@
4254
},
4355
{
4456
"cell_type": "code",
45-
"execution_count": 80,
57+
"execution_count": null,
4658
"metadata": {},
4759
"outputs": [],
4860
"source": [
@@ -67,18 +79,9 @@
6779
},
6880
{
6981
"cell_type": "code",
70-
"execution_count": 81,
82+
"execution_count": null,
7183
"metadata": {},
72-
"outputs": [
73-
{
74-
"name": "stdout",
75-
"output_type": "stream",
76-
"text": [
77-
" text\n",
78-
"0 Mr. Speaker, Madam Vice President, our First L...\n"
79-
]
80-
}
81-
],
84+
"outputs": [],
8285
"source": [
8386
"# Download the transcript of the US State of the Union 2023\n",
8487
"url = \"https://bit.ly/datafog-sample-text-sotu-2023\"\n",
@@ -113,18 +116,9 @@
113116
},
114117
{
115118
"cell_type": "code",
116-
"execution_count": 82,
119+
"execution_count": null,
117120
"metadata": {},
118-
"outputs": [
119-
{
120-
"name": "stdout",
121-
"output_type": "stream",
122-
"text": [
123-
"PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]\n",
124-
"PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]\n"
125-
]
126-
}
127-
],
121+
"outputs": [],
128122
"source": [
129123
"ceo_email_chunk = \"I'm announcing on Friday that Jeff is going to be CTO.\"\n",
130124
"\n",

Diff for: public/rag-chat.png

1.11 MB
Loading

0 commit comments

Comments
 (0)