feat: allowlist support, presidio entity aliases, py.typed (4.7.0) by sidmohan0 · Pull Request #159 · DataFog/datafog-python

sidmohan0 · 2026-07-02T22:21:08Z

What this is

The engine work for a fast 4.7.0 release (pulled forward from DFPY-110/115 scope), motivated by a day of dogfooding the firewall and by the litellm upstream PR:

Allowlist: scan(text, allowlist=[...]) for exact values, allowlist_patterns=[...] for full-match regexes. Threaded through both adapters: DATAFOG_HOOK_ALLOWLIST / DATAFOG_HOOK_ALLOWLIST_PATTERNS env vars for the Claude Code hook, constructor params for the LiteLLM guardrail. The motivating false-positive catalog from today: unix timestamps and 10-digit API IDs matching PHONE, own email in tool metadata, doc placeholders, test fixtures.
Presidio entity aliases: EMAIL_ADDRESS, US_SSN accepted as input aliases (via the existing CANONICAL_TYPE_MAP, which already had PHONE_NUMBER) — the migration bridge for presidio configs.
py.typed: ships the marker + package_data, so downstream type checkers (including litellm's basedpyright gate, which currently needs a suppression for our import) see our annotations.
Backports the upstream Greptile review fixes to the in-repo litellm adapter (guardrail spans recorded on the returned dict; redaction reported as guardrail_intervened).
Docs correction: fixes an entity-name error I introduced in docs: refresh README for 4.6 #156 — the scan API returns DATE/ZIP_CODE; DOB/ZIP are input aliases, not output types.

Design decisions

Exact allowlist matches full entity text only (no substring); patterns must fullmatch — a partial match never suppresses a finding.
Invalid patterns raise ValueError at the API boundary (fail fast); the hook converts that to its usual fail-open.
redact(entities=[...], allowlist=...) is rejected explicitly — filtering pre-scanned entities is the caller's job.

Why now (release strategy)

litellm's supply-chain quarantine (exclude-newer = 3 days) means any datafog release takes 3 days to become usable in their CI. Shipping 4.7.0 immediately starts that clock: quarantine-clear by ~July 6, in time for the July 5 CI-pin push to go straight to 4.7.0, and it removes the type-stub suppression from the upstream PR.

Test plan

11 new allowlist/alias/py.typed tests (tests/test_allowlist.py), TDD (verified RED first)
5 new hook allowlist tests, 1 new guardrail allowlist test
Full suite: 622 passed (3 pre-existing spaCy-extra import failures, environmental)
pre-commit clean (local prettier hook flake verified against standalone prettier)
CI green

Adds allowlist (exact values) and allowlist_patterns (full-match regexes) to scan/redact and threads them through both agent adapters: DATAFOG_HOOK_ALLOWLIST / DATAFOG_HOOK_ALLOWLIST_PATTERNS env vars for the Claude Code hook, allowlist/allowlist_patterns params for the LiteLLM guardrail. Motivated by a day of dogfooding: unix timestamps and numeric IDs match the PHONE pattern, and intentional identifiers (own support email, doc placeholders) should be exemptable. Accepts presidio-style entity names (EMAIL_ADDRESS, US_SSN) as input aliases via the existing canonical type map, ships a py.typed marker so downstream type checkers see our annotations, and backports the upstream-review fixes to the in-repo litellm adapter (guardrail spans recorded on the returned dict, redaction reported as intervention). Also corrects an entity-name documentation error introduced in #156: the scan API returns DATE and ZIP_CODE (DOB/ZIP are input aliases).

Review findings: reject quantified groups containing nested quantifiers at compile time (catastrophic backtracking on attacker-influenced entity text), cap pattern length at 512 chars, and skip pattern matching for entities longer than 512 chars (fail-safe: the finding is kept). Match semantics documented as case-sensitive with no Unicode normalization; allowlist entries are operator configuration, never end-user input. Adds regression tests for the rejection heuristic, the smart-engine path, and the redact(entities=..., allowlist=...) guard. Replaces a walrus assignment with a plain one in the litellm adapter.

sidmohan0 added 3 commits July 2, 2026 15:21

test: cover subject-length cap fail-safe in allowlist pattern matching

4894c8d

sidmohan0 merged commit 6bb4c89 into dev Jul 2, 2026
26 checks passed

This was referenced Jul 2, 2026

Release v4.7.0 #160

Merged

ci: make codecov wait for all coverage uploads before computing status #161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allowlist support, presidio entity aliases, py.typed (4.7.0)#159

feat: allowlist support, presidio entity aliases, py.typed (4.7.0)#159
sidmohan0 merged 3 commits into
devfrom
feat/4.7-allowlist

sidmohan0 commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sidmohan0 commented Jul 2, 2026

What this is

Design decisions

Why now (release strategy)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant