Skip to content

fix: apply allow/exclude_repositories filters to transcript session data#1623

Open
Siddhant-K-code wants to merge 7 commits into
mainfrom
fix/transcript-repo-filter
Open

fix: apply allow/exclude_repositories filters to transcript session data#1623
Siddhant-K-code wants to merge 7 commits into
mainfrom
fix/transcript-repo-filter

Conversation

@Siddhant-K-code

@Siddhant-K-code Siddhant-K-code commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Closes #1605

Problem

A customer expected their allow_repositories / exclude_repositories filters to also apply to session (transcript) data, but those filters only gated checkpoints (Config::is_allowed_repository in src/commands/git_ai_handlers.rs). Session/transcript MetricEvents (SessionEvent, OtelTrace) were emitted and uploaded regardless of the configured repository lists.

The transcript pipeline already resolves the session's repository (hook repo_work_dir > DB > inferred cwd) in order to stamp repo_url on each event — it just never consulted the filters. This is effectively a bug fix: the boilerplate already existed.

Fix

In src/daemon/stream_worker.rs:

  • New session_repo_allowed(config, work_dir) helper.
  • Early bail-out in process_session_blocking, placed right after the work_dir is resolved and before any events are emitted or the watermark is advanced. On a skip we emit nothing and intentionally do not advance the watermark, so a backlog re-flows if the filters later change.
  • Reads Config::fresh() so a long-lived daemon observes config changes without a restart.
  • Reuses Config::is_allowed_repository_with_remotes, so semantics match the checkpoint path exactly: exclusions take precedence, an empty allowlist allows everything, and an active allowlist fails closed when the repo can't be verified (no work_dir, not a git repo, or no remote).

A single filter covers both entry points (checkpoint-triggered notifications and the periodic sweep) since both flow through process_session_blocking.

Decisions

  • Fail-closed under an allowlist. Customers set these filters for security, so a session whose repository can't be verified is dropped when an allowlist is active.
  • Shared streams (Copilot OTEL) never carry a repo URL — resolved_work_dir is forced None for them, and OTEL spans contain no path/cwd. They therefore fall into the unknown-repo path: dropped under an active allowlist, passed under exclude-only filters. Worth noting: setting allow_repositories disables Copilot OTEL telemetry, which is the security-correct behavior.

Tests

  • 6 unit tests in src/daemon/stream_worker_tests.rs exercising the real session_repo_allowed against real git repos, real Config, and real glob matching: no-filters fast path, allowlist match/miss, exclude precedence, fail-closed unknown-repo under allowlist, and exclude-only unknown-repo pass-through.
  • Added allow_repositories / exclude_repositories to ConfigPatch + apply_test_config_patch, wired into the TestRepo home-config writer (enables daemon-level config patching in future integration tests).
  • Added Config::with_repository_filters_for_test.

task build, task lint, task fmt all clean; new tests pass; config/streams suites show no regressions.

How to verify

Automated (unit tests for the filter decision)

# The 6 new tests covering the filter decision (real git repos + real Config + globs)
task test TEST_FILTER=session_repo_allowed CARGO_TEST_ARGS="--lib"

# Existing session/transcript suites still green
task test TEST_FILTER=session_event
task test TEST_FILTER=streams_e2e

# Lint + format (must be clean for CI)
task lint
task fmt

Manual end-to-end (real daemon)

# 1. Install the local debug build system-wide (routes git through git-ai + restarts daemon)
task dev

# 2. In a repo whose remote you want to EXCLUDE:
cd /path/to/secret-repo
git remote -v   # e.g. git@github.com:acme/secret.git

# 3. Add it to the exclude list (accepts a pattern, a path, or '.' for the current repo's remotes)
git-ai config set exclude_repositories "*github.com/acme/secret*"
# (or, allowlist-only: git-ai config set allow_repositories "*github.com/acme/allowed*")

# 4. Drive an AI session in that repo (any agent that streams transcripts, e.g. claude),
#    then trigger streaming via a checkpoint + commit.

# 5. Confirm NO session events were emitted for the excluded repo.
#    With GIT_AI_DEBUG=1 the daemon logs the skip:
#    "skipping session events: repository excluded or not in allow_repositories"
GIT_AI_DEBUG=1 git-ai <command>   # observe daemon stderr / logs

# 6. Repeat in a repo that IS allowed (or with no filters) and confirm session events DO flow.

Expected:

  • Excluded repo (or repo not in an active allowlist): no SessionEvent / OtelTrace metrics emitted; watermark not advanced.
  • Allowed repo / no filters configured: session events emitted as before.
  • Allowlist set + Copilot OTEL (no derivable repo): dropped (fail-closed).

🤖 Generated with Claude Code

Repository allow/exclude filters previously only gated checkpoints
(src/commands/git_ai_handlers.rs); session (transcript) MetricEvents were
emitted regardless of the configured repository lists. A customer expected
these filters to also scope session data.

Filter session events in the transcript streaming pipeline, mirroring the
checkpoint-time filter:

- Add `session_repo_allowed(config, work_dir)` and an early bail-out in
  `process_session_blocking`, placed after the work_dir is resolved and
  before any events are emitted or the watermark is advanced. On a skip we
  emit nothing and intentionally do NOT advance the watermark, so a backlog
  re-flows if the filters later change. Reads `Config::fresh()` so a running
  daemon observes config changes without a restart.
- Reuse `Config::is_allowed_repository_with_remotes` so semantics match the
  checkpoint path exactly: exclusions take precedence, an empty allowlist
  allows everything, and an active allowlist fails closed when the repo can't
  be verified (no work_dir, not a git repo, or no remote). Shared streams
  (e.g. Copilot OTEL) never carry a repo URL, so they fall into this
  fail-closed path under an allowlist and pass under exclude-only filters.

Test support:
- Add allow_repositories/exclude_repositories to ConfigPatch and
  apply_test_config_patch, and wire them into the TestRepo home config writer.
- Add Config::with_repository_filters_for_test.
- Add 6 unit tests covering no-filters, allowlist match/miss, exclude
  precedence, and fail-closed/exclude-only unknown-repo behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Siddhant-K-code Siddhant-K-code marked this pull request as draft June 23, 2026 14:45
devin-ai-integration[bot]

This comment was marked as resolved.

@Siddhant-K-code

Copy link
Copy Markdown
Collaborator Author

Follow-up on Devin review 4554408177:

Excluded sessions cause repeated expensive I/O on every sweep cycle

Fixed in f51ff781c (fix: avoid repeated filtered transcript sweeps). When a session is filtered out, the worker now updates the stream file metadata (last_known_size / last_modified) but does not advance the event watermark or last_processed_at. That keeps filtered events un-emitted while preventing unchanged excluded transcripts from being rediscovered every sweep cycle.

Added a regression test verifying the metadata update marks the file seen without advancing the watermark.

Shared streams are unconditionally dropped under an active allowlist

This behavior is intentional and security-correct: shared streams like Copilot OTEL do not carry enough repo context to verify against allow_repositories, so they must fail closed. I tightened this path so allowlist-blocked shared streams are skipped during sweep discovery instead of being enqueued and then dropped by the worker. Exclude-only configs still allow unknown/shared streams through, matching the existing filter semantics.

Also fixed two unrelated newer-Clippy blockers on this branch (src/daemon.rs collapsible match guard and the integration fuzzer helper ? rewrite) so lint is clean under the available toolchain.

Verification run locally:

~/.cargo/bin/cargo fmt
~/.cargo/bin/cargo test metadata_update_marks_file_seen_without_advancing_watermark --lib -- --test-threads 1
~/.cargo/bin/cargo test session_repo_allowed --lib -- --test-threads 1
~/.cargo/bin/cargo clippy --all-targets -- -D warnings

Pushed to git-ai-project:fix/transcript-repo-filter.

@Siddhant-K-code Siddhant-K-code marked this pull request as ready for review June 23, 2026 15:29
devin-ai-integration[bot]

This comment was marked as resolved.

@Siddhant-K-code

Copy link
Copy Markdown
Collaborator Author

Final follow-up on the review comments and CI:

  • Addressed the filtered-session sweep churn without advancing the event watermark.
  • Added the repository-filter fingerprint path so skipped inactive streams are reprocessed when allow_repositories / exclude_repositories changes later.
  • Kept the shared-stream allowlist behavior fail-closed as discussed; unknown-repo shared streams are not allowed through an active allowlist.
  • Devin's follow-up marked the config-change concern resolved.
  • Re-checked review and PR discussion comments after CI finished; no new comments were present.

Validation:

cargo check --tests
cargo fmt -- --check
cargo clippy --all-targets -- -D warnings
cargo test --lib session_repo_allowed -- --test-threads 12
cargo test --lib filter_skip -- --test-threads 12
GIT_AI_TEST_SHARED_DAEMON_POOL_SIZE=12 cargo test -- --test-threads 12

CI is green for the visible PR checks, including Ubuntu, macOS, and Windows test jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add session events to allow/exclude repositories filter

1 participant