|
3 | 3 | </p>
|
4 | 4 |
|
5 | 5 | <p align="center">
|
6 |
| - <b>Open-source PII Detection for Retrieval Systems</b>. <br /> |
7 |
| - Scan, redact, and manage PII in your documents before they get uploaded to a Generative AI system. |
| 6 | + <b>Open-source DevSecOps for Generative AI Systems</b>. <br /> |
8 | 7 | </p>
|
9 | 8 |
|
10 | 9 | <p align="center">
|
11 | 10 | <a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/v/datafog.svg?style=flat-square" alt="PyPi Version"></a>
|
12 | 11 | <a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/pyversions/datafog.svg?style=flat-square" alt="PyPI pyversions"></a>
|
13 | 12 | <a href="https://github.com/datafog/datafog-python"><img src="https://img.shields.io/github/stars/datafog/datafog-python.svg?style=flat-square&logo=github&label=Stars&logoColor=white" alt="GitHub stars"></a>
|
14 | 13 | <a href="https://pypistats.org/packages/datafog"><img src="https://img.shields.io/pypi/dm/datafog.svg?style=flat-square" alt="PyPi downloads"></a>
|
15 |
| -</p> |
16 |
| - |
17 |
| -<p align="center"> |
| 14 | + <a href="https://discord.gg/bzDth394R4"><img src="https://img.shields.io/discord/1173803135341449227?style=flat" alt="Discord"></a> |
18 | 15 | <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square" alt="Code style: black"></a>
|
19 |
| -</p> |
20 |
| - |
21 |
| -<p align="center"> |
22 | 16 | <a href="https://codecov.io/gh/datafog/datafog-python"><img src="https://img.shields.io/codecov/c/github/datafog/datafog-python.svg?style=flat-square" alt="codecov"></a>
|
| 17 | + <a href="https://github.com/datafog/datafog-python/issues"><img src="https://img.shields.io/github/issues/datafog/datafog-python.svg?style=flat-square" alt="GitHub Issues"></a> |
23 | 18 | </p>
|
24 | 19 |
|
25 | 20 | ## Overview
|
26 | 21 |
|
27 |
| -DataFog works by scanning and redacting-out PII in files **before** they get uploaded to a Generative AI system. |
| 22 | +### What is DataFog? |
| 23 | + |
| 24 | +DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications. |
| 25 | + |
| 26 | +### What problem are we solving? |
| 27 | + |
| 28 | +**Context** |
| 29 | + |
| 30 | +The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases! |
| 31 | + |
| 32 | +**Problem** |
| 33 | + |
| 34 | +How do you keep: |
| 35 | + |
| 36 | +- Customer PII |
| 37 | +- Employee PII |
| 38 | +- Sensitive company information pertaining to org changes or restructurings |
| 39 | +- Pending M&A activity |
| 40 | +- Conversations with external counsel on material corporate matters (i.e. product recall, etc) |
| 41 | +- and more |
| 42 | + |
| 43 | +from entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs. |
28 | 44 |
|
29 |
| -## How it works |
| 45 | +That's where DataFog comes in. Our solution to this problem is through two major approaches: |
30 | 46 |
|
31 |
| -<img src="https://www.datafog.ai/hero.png" alt="DataFog Overview"> |
| 47 | +**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities |
| 48 | +**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database |
| 49 | + |
| 50 | +With this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running. |
| 51 | + |
| 52 | +### How it works |
| 53 | + |
| 54 | +<img src="https://www.datafog.ai/hero.png" alt="DataFog Overview" style="width:50%;"> |
| 55 | + |
| 56 | +### There's lots of PII tools out there; why DataFog? |
| 57 | + |
| 58 | +If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA'). |
| 59 | +In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products |
| 60 | +are purpose-built for the problem that they are solving. |
| 61 | + |
| 62 | +However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge. |
32 | 63 |
|
33 | 64 | ## Installation
|
34 | 65 |
|
35 | 66 | DataFog can be installed via pip:
|
36 | 67 |
|
37 | 68 | ```bash
|
38 |
| -pip install datafog # python client |
| 69 | +pip install datafog |
39 | 70 | ```
|
40 | 71 |
|
41 |
| -## Usage |
42 |
| - |
43 |
| -We're going to build up functionality starting with support for the Microsoft Presidio library. If you have any custom requests that would be of benefit to the community, please let us know! |
| 72 | +and in your python environment: |
44 | 73 |
|
45 | 74 | ```
|
46 |
| - import requests |
47 |
| - from datafog import PresidioEngine as presidio |
48 |
| -
|
49 |
| - # Example: Detecting PII in a String |
50 |
| - pii_detected = presidio.scan("My name is John Doe and my email is [email protected]") |
51 |
| - print("PII Detected:", pii_detected) |
52 |
| -
|
53 |
| - # Example: Detecting PII in a File |
54 |
| - sample_filepath = "/Users/sidmohan/Desktop/v2.0.0/datafog-python/tests/files/input_files/sample.csv" |
55 |
| - with open(sample_filepath, "r") as f: |
56 |
| - original_value = f.read() |
57 |
| - pii_detected = presidio.scan(original_value) |
58 |
| - print("PII Detected in File:", pii_detected) |
59 |
| -
|
60 |
| - # Example: Detecting PII in a URL |
61 |
| - sample_url = "https://gist.githubusercontent.com/sidmohan0/1aa3ec38b4e6594d3c34b113f2e0962d/raw/42e57146197be0f85a5901cd1dcdd9ad15b31bab/sotu_2023.txt" |
62 |
| - response = requests.get(sample_url) |
63 |
| - original_value = response.text |
64 |
| - pii_detected = presidio.scan(original_value) |
65 |
| - print("PII Detected in URL Content:", pii_detected) |
66 |
| -
|
| 75 | +from datafog import PresidioEngine as presidio |
67 | 76 | ```
|
68 | 77 |
|
69 |
| -Depending on your input, the output will be a list of detected PII entities: |
| 78 | +## Examples |
| 79 | + |
| 80 | +Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know! |
70 | 81 |
|
71 | 82 | ```
|
72 |
| -PII Detected: [type: EMAIL_ADDRESS, start: 36, end: 53, score: 1.0, type: PERSON, start: 11, end: 19, score: 0.85, type: URL, start: 44, end: 53, score: 0.5] |
| 83 | + ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO." |
| 84 | +
|
| 85 | + scan_results1 = presidio.scan(ceo_email_chunk) |
| 86 | + print("PII Detected - base case:", scan_results1) |
| 87 | + # PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85] |
| 88 | +
|
| 89 | +
|
| 90 | + scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO']) |
| 91 | + print("PII Detected with deny list:", scan_results2) |
| 92 | + # PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85] |
| 93 | +
|
73 | 94 | ```
|
74 | 95 |
|
75 | 96 | ## Contributing
|
76 | 97 |
|
77 |
| -This is an open-source project and we welcome contributions. If you have any questions, please feel free to reach out to us, join our Discord or email me directly at [email protected]. |
| 98 | +DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community. |
78 | 99 |
|
79 | 100 | ### Dev Notes
|
80 | 101 |
|
81 |
| -- Clone repo |
82 |
| -- Run 'poetry install' to install dependencies (recommend entering poetry shell for preserving dependencies) |
83 | 102 | - Justfile commands:
|
84 | 103 | - `just format` to apply formatting.
|
85 | 104 | - `just lint` to check formatting and style.
|
86 |
| - - `just tag` to tag your project on git |
87 |
| - - `just upload` to publish to PyPi. |
88 | 105 |
|
89 | 106 | ### Testing
|
90 | 107 |
|
|
96 | 113 |
|
97 | 114 | ```
|
98 | 115 |
|
99 |
| -### License |
| 116 | +## License |
100 | 117 |
|
101 | 118 | This software is published under the [MIT
|
102 | 119 | license](https://en.wikipedia.org/wiki/MIT_License).
|
0 commit comments