You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-24Lines changed: 20 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -20,25 +20,25 @@
20
20
## Overview
21
21
22
22
### What is DataFog?
23
-
DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.
23
+
24
+
DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.
24
25
25
26
### What problem are we solving?
26
27
27
28
**Context**
28
29
29
-
The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!
30
-
30
+
The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!
31
31
32
32
**Problem**
33
33
34
34
How do you keep:
35
35
36
-
* Customer PII
37
-
* Employee PII
38
-
* Sensitive company information pertaining to org changes or restructurings
39
-
* Pending M&A activity
40
-
* Conversations with external counsel on material corporate matters (i.e. product recall, etc)
41
-
* and more
36
+
- Customer PII
37
+
- Employee PII
38
+
- Sensitive company information pertaining to org changes or restructurings
39
+
- Pending M&A activity
40
+
- Conversations with external counsel on material corporate matters (i.e. product recall, etc)
41
+
- and more
42
42
43
43
from entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.
44
44
@@ -47,39 +47,37 @@ That's where DataFog comes in. Our solution to this problem is through two major
47
47
**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities
48
48
**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database
49
49
50
-
51
50
With this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running.
### There's lots of PII tools out there; why DataFog?
58
57
59
-
### There's lots of PII tools out there; why DataFog?
60
-
If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').
61
-
In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products
62
-
are purpose-built for the problem that they are solving.
63
-
64
-
However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.
58
+
If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').
59
+
In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products
60
+
are purpose-built for the problem that they are solving.
65
61
62
+
However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.
66
63
67
64
## Installation
68
65
69
66
DataFog can be installed via pip:
70
67
71
68
```bash
72
-
pip install datafog
69
+
pip install datafog
73
70
```
74
71
75
-
and in your python environment:
72
+
and in your python environment:
76
73
77
74
```
78
75
from datafog import PresidioEngine as presidio
79
76
```
80
77
81
78
## Examples
82
-
Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!
79
+
80
+
Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!
83
81
84
82
```
85
83
ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."
@@ -94,18 +92,17 @@ Here are some examples of datafog being used to redact information in business c
DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.
96
+
## Contributing
100
97
98
+
DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.
101
99
102
100
### Dev Notes
103
101
104
102
- Justfile commands:
105
103
-`just format` to apply formatting.
106
104
-`just lint` to check formatting and style.
107
105
108
-
109
106
### Testing
110
107
111
108
To run the datafog unit tests, check out this repository and do
0 commit comments