Fast processing • Production-ready • Simple configuration
DataFog provides efficient PII detection using a pattern-first approach that processes text significantly faster than traditional NLP methods while maintaining high accuracy.
# Basic usage example
from datafog import DataFog
results = DataFog().scan_text("John's email is [email protected] and SSN is 123-45-6789")
Engine | 10KB Text Processing | Relative Speed | Accuracy |
---|---|---|---|
DataFog (Regex) | ~2.4ms | 190x faster | High (structured) |
DataFog (GLiNER) | ~15ms | 32x faster | Very High |
DataFog (Smart) | ~3-15ms | 60x faster | Highest |
spaCy | ~459ms | baseline | Good |
Performance measured on 13.3KB business document. GLiNER provides excellent accuracy for named entities while maintaining speed advantage.
Type | Examples | Use Cases |
---|---|---|
[email protected] | Contact scrubbing | |
Phone | (555) 123-4567 | Call log anonymization |
SSN | 123-45-6789 | HR data protection |
Credit Cards | 4111-1111-1111-1111 | Payment processing |
IP Addresses | 192.168.1.1 | Network log cleaning |
Dates | 01/01/1990 | Birthdate removal |
ZIP Codes | 12345-6789 | Location anonymization |
# Lightweight core (fast regex-based PII detection)
pip install datafog
# With advanced ML models for better accuracy
pip install datafog[nlp] # spaCy for advanced NLP
pip install datafog[nlp-advanced] # GLiNER for modern NER
pip install datafog[ocr] # Image processing with OCR
pip install datafog[all] # Everything included
Detect PII in text:
from datafog import DataFog
# Simple detection (uses fast regex engine)
detector = DataFog()
text = "Contact John Doe at [email protected] or (555) 123-4567"
results = detector.scan_text(text)
print(results)
# Finds: emails, phone numbers, and more
# Modern NER with GLiNER (requires: pip install datafog[nlp-advanced])
from datafog.services import TextService
gliner_service = TextService(engine="gliner")
result = gliner_service.annotate_text_sync("Dr. John Smith works at General Hospital")
# Detects: PERSON, ORGANIZATION with high accuracy
# Best of both worlds: Smart cascading (recommended for production)
smart_service = TextService(engine="smart")
result = smart_service.annotate_text_sync("Contact [email protected] or call (555) 123-4567")
# Uses regex for structured PII (fast), GLiNER for entities (accurate)
Anonymize on the fly:
# Redact sensitive data
redacted = DataFog(operations=["scan", "redact"]).process_text(
"My SSN is 123-45-6789 and email is [email protected]"
)
print(redacted)
# Output: "My SSN is [REDACTED] and email is [REDACTED]"
# Replace with fake data
replaced = DataFog(operations=["scan", "replace"]).process_text(
"Call me at (555) 123-4567"
)
print(replaced)
# Output: "Call me at [PHONE_A1B2C3]"
Process images with OCR:
import asyncio
from datafog import DataFog
async def scan_document():
ocr_scanner = DataFog(operations=["extract", "scan"])
results = await ocr_scanner.run_ocr_pipeline([
"https://example.com/document.png"
])
return results
# Extract text and find PII in images
results = asyncio.run(scan_document())
Choose the appropriate engine for your needs:
from datafog.services import TextService
# Regex: Fast, pattern-based (recommended for speed)
regex_service = TextService(engine="regex")
# spaCy: Traditional NLP with broad entity recognition
spacy_service = TextService(engine="spacy")
# GLiNER: Modern ML model optimized for NER (requires nlp-advanced extra)
gliner_service = TextService(engine="gliner")
# Smart: Cascading approach - regex → GLiNER → spaCy (best accuracy/speed balance)
smart_service = TextService(engine="smart")
# Auto: Regex → spaCy fallback (legacy)
auto_service = TextService(engine="auto")
Performance & Accuracy Guide:
Engine | Speed | Accuracy | Use Case | Install Requirements |
---|---|---|---|---|
regex |
🚀 Fastest | Good | Structured PII (emails, phones) | Core only |
gliner |
⚡ Fast | Better | Modern NER, custom entities | pip install datafog[nlp-advanced] |
spacy |
🐌 Slower | Good | Traditional NLP entities | pip install datafog[nlp] |
smart |
⚡ Balanced | Best | Combines all approaches | pip install datafog[nlp-advanced] |
Model Management:
# Download specific GLiNER models
import subprocess
# PII-specialized model (recommended)
subprocess.run(["datafog", "download-model", "urchade/gliner_multi_pii-v1", "--engine", "gliner"])
# General-purpose model
subprocess.run(["datafog", "download-model", "urchade/gliner_base", "--engine", "gliner"])
# List available models
subprocess.run(["datafog", "list-models", "--engine", "gliner"])
from datafog import DataFog
from datafog.models.anonymizer import AnonymizerType, HashType
# Hash with different algorithms
hasher = DataFog(
operations=["scan", "hash"],
hash_type=HashType.SHA256 # or MD5, SHA3_256
)
# Target specific entity types only
selective = DataFog(
operations=["scan", "redact"],
entities=["EMAIL", "PHONE"] # Only process these types
)
documents = [
"Document 1 with PII...",
"Document 2 with more data...",
"Document 3..."
]
# Process multiple documents efficiently
results = DataFog().batch_process(documents)
Performance comparison with alternatives:
DataFog Pattern: 4ms ████████████████████████████████ 123x faster
spaCy: 480ms ██ baseline
Scenario | Recommended Engine | Why |
---|---|---|
High-volume processing | pattern |
Maximum speed, consistent performance |
Unknown entity types | spacy |
Broader entity recognition |
General purpose | auto |
Smart fallback, best of both worlds |
Real-time applications | pattern |
Sub-millisecond processing |
DataFog includes a command-line interface:
# Scan text for PII
datafog scan-text "John's email is [email protected]"
# Process images
datafog scan-image document.png --operations extract,scan
# Anonymize data
datafog redact-text "My phone is (555) 123-4567"
datafog replace-text "SSN: 123-45-6789"
datafog hash-text "Email: [email protected]" --hash-type sha256
# Utility commands
datafog health
datafog list-entities
datafog show-config
- Detection of regulated data types for GDPR/CCPA compliance
- Audit trails for tracking detection and anonymization
- Configurable detection thresholds
- Batch processing for handling multiple documents
- Memory-efficient processing for large files
- Async support for non-blocking operations
# FastAPI middleware example
from fastapi import FastAPI
from datafog import DataFog
app = FastAPI()
detector = DataFog()
@app.middleware("http")
async def redact_pii_middleware(request, call_next):
# Automatically scan/redact request data
pass
- Log sanitization
- Data migration with PII handling
- Compliance reporting and audits
- Dataset preparation and anonymization
- Privacy-preserving analytics
- Research compliance
- Test data generation
- Code review for PII detection
- API security validation
pip install datafog
git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements-dev.txt
just setup
FROM python:3.10-slim
RUN pip install datafog
COPY . .
CMD ["python", "your_script.py"]
Contributions are welcome in the form of:
- Bug reports
- Feature requests
- Documentation improvements
- New pattern patterns for PII detection
- Performance improvements
# Setup development environment
git clone https://github.com/datafog/datafog-python
cd datafog-python
just setup
# Run tests
just test
# Format code
just format
# Submit PR
git checkout -b feature/your-improvement
# Make your changes
git commit -m "Add your improvement"
git push origin feature/your-improvement
See CONTRIBUTING.md for detailed guidelines.
# Install benchmark dependencies
pip install pytest-benchmark
# Run performance tests
pytest tests/benchmark_text_service.py -v
# Compare with baseline
python scripts/run_benchmark_locally.sh
Our CI pipeline:
- Runs benchmarks on every PR
- Compares against baseline performance
- Fails builds if performance degrades >10%
- Tracks performance trends over time
Resource | Link |
---|---|
Documentation | docs.datafog.ai |
Community Discord | Join here |
Bug Reports | GitHub Issues |
Feature Requests | GitHub Discussions |
Support | [email protected] |
DataFog is released under the MIT License.
Built with:
- Pattern optimization for efficient processing
- spaCy integration for NLP capabilities
- Tesseract & Donut for OCR capabilities
- Pydantic for data validation