Skip to content

feat: allowlist support, presidio entity aliases, py.typed (4.7.0)#159

Merged
sidmohan0 merged 3 commits into
devfrom
feat/4.7-allowlist
Jul 2, 2026
Merged

feat: allowlist support, presidio entity aliases, py.typed (4.7.0)#159
sidmohan0 merged 3 commits into
devfrom
feat/4.7-allowlist

Conversation

@sidmohan0

Copy link
Copy Markdown
Contributor

What this is

The engine work for a fast 4.7.0 release (pulled forward from DFPY-110/115 scope), motivated by a day of dogfooding the firewall and by the litellm upstream PR:

  • Allowlist: scan(text, allowlist=[...]) for exact values, allowlist_patterns=[...] for full-match regexes. Threaded through both adapters: DATAFOG_HOOK_ALLOWLIST / DATAFOG_HOOK_ALLOWLIST_PATTERNS env vars for the Claude Code hook, constructor params for the LiteLLM guardrail. The motivating false-positive catalog from today: unix timestamps and 10-digit API IDs matching PHONE, own email in tool metadata, doc placeholders, test fixtures.
  • Presidio entity aliases: EMAIL_ADDRESS, US_SSN accepted as input aliases (via the existing CANONICAL_TYPE_MAP, which already had PHONE_NUMBER) — the migration bridge for presidio configs.
  • py.typed: ships the marker + package_data, so downstream type checkers (including litellm's basedpyright gate, which currently needs a suppression for our import) see our annotations.
  • Backports the upstream Greptile review fixes to the in-repo litellm adapter (guardrail spans recorded on the returned dict; redaction reported as guardrail_intervened).
  • Docs correction: fixes an entity-name error I introduced in docs: refresh README for 4.6 #156 — the scan API returns DATE/ZIP_CODE; DOB/ZIP are input aliases, not output types.

Design decisions

  • Exact allowlist matches full entity text only (no substring); patterns must fullmatch — a partial match never suppresses a finding.
  • Invalid patterns raise ValueError at the API boundary (fail fast); the hook converts that to its usual fail-open.
  • redact(entities=[...], allowlist=...) is rejected explicitly — filtering pre-scanned entities is the caller's job.

Why now (release strategy)

litellm's supply-chain quarantine (exclude-newer = 3 days) means any datafog release takes 3 days to become usable in their CI. Shipping 4.7.0 immediately starts that clock: quarantine-clear by ~July 6, in time for the July 5 CI-pin push to go straight to 4.7.0, and it removes the type-stub suppression from the upstream PR.

Test plan

  • 11 new allowlist/alias/py.typed tests (tests/test_allowlist.py), TDD (verified RED first)
  • 5 new hook allowlist tests, 1 new guardrail allowlist test
  • Full suite: 622 passed (3 pre-existing spaCy-extra import failures, environmental)
  • pre-commit clean (local prettier hook flake verified against standalone prettier)
  • CI green

sidmohan0 added 3 commits July 2, 2026 15:21
Adds allowlist (exact values) and allowlist_patterns (full-match
regexes) to scan/redact and threads them through both agent adapters:
DATAFOG_HOOK_ALLOWLIST / DATAFOG_HOOK_ALLOWLIST_PATTERNS env vars for
the Claude Code hook, allowlist/allowlist_patterns params for the
LiteLLM guardrail. Motivated by a day of dogfooding: unix timestamps
and numeric IDs match the PHONE pattern, and intentional identifiers
(own support email, doc placeholders) should be exemptable.

Accepts presidio-style entity names (EMAIL_ADDRESS, US_SSN) as input
aliases via the existing canonical type map, ships a py.typed marker
so downstream type checkers see our annotations, and backports the
upstream-review fixes to the in-repo litellm adapter (guardrail spans
recorded on the returned dict, redaction reported as intervention).

Also corrects an entity-name documentation error introduced in #156:
the scan API returns DATE and ZIP_CODE (DOB/ZIP are input aliases).
Review findings: reject quantified groups containing nested quantifiers
at compile time (catastrophic backtracking on attacker-influenced entity
text), cap pattern length at 512 chars, and skip pattern matching for
entities longer than 512 chars (fail-safe: the finding is kept). Match
semantics documented as case-sensitive with no Unicode normalization;
allowlist entries are operator configuration, never end-user input.
Adds regression tests for the rejection heuristic, the smart-engine
path, and the redact(entities=..., allowlist=...) guard. Replaces a
walrus assignment with a plain one in the litellm adapter.
@sidmohan0 sidmohan0 merged commit 6bb4c89 into dev Jul 2, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant