Speed up `git-ai stats` range mode by batching authorship-note reads by jwiegley · Pull Request #1652 · git-ai-project/git-ai

jwiegley · 2026-06-24T23:34:27Z

Summary

Makes git-ai stats <a>..<b> (range mode) faster — ~1.7× on a 100-commit range up to ~2.5× on a 1000-commit range — by removing redundant git-subprocess fan-out in the blame path. range_stats output is byte-identical.

Profiling (release build, real repo) showed the cost is git subprocesses, not SQLite — the default GitNotes backend touches 0 ms of SQLite on this path, so a #1630-style index would not help git-ai stats. The range path blames every changed file three times (diff_ai_accepted_stats + start/end VirtualAttributions), and each blame re-read every blamed commit's authorship note with a separate git notes --ref=ai show — thousands of per-commit note spawns, uncached across files and passes.

Changes (both output-preserving)

Batch the note reads. New notes_api::read_authorship_v3_batch resolves many commits' notes with one git ls-tree + cat-file --batch (via refs::notes_for_commits) instead of one git notes show per commit. It returns exactly what read_authorship_v3(repo, sha).ok() yields per commit — the v3 parse is factored into refs::parse_reference_as_authorship_log_v3 and shared by both paths, and the Http backend delegates per commit to preserve its cache-hit semantics. Both blame overlays (overlay_ai_authorship, populate_ai_human_authors) pre-seed their existing per-commit cache from one batched read.
Skip dead work on the stats path. blame() with no_output discards the blame hunks, so the populate_ai_human_authors pass that annotates them with ai_human_author (a second per-commit note read) is wasted there. Adds GitAiBlameOptions::skip_human_author_population (default false, debug-asserted to only be set on no_output paths) and sets it on the two stats blame call sites. It is output-invariant because overlay_ai_authorship derives line_authors per line independently of hunk grouping.

Performance (release build, real repo, same-run A/B, `range_stats` byte-identical)

The speedup grows with range size because it eliminates per-commit note fan-out, which scales with the number of blamed commits:

command	baseline (same run)	optimized	speedup
`git-ai stats HEAD~100..HEAD`	69.0 s (noisy: 69–93 s across runs)	40.0 s	~1.7×
`git-ai stats HEAD~1000..HEAD`	272.3 s	110.2 s	~2.5×

Under a subprocess-counting harness on HEAD~100..HEAD, per-commit git notes show collapses from ~7,200 to ~50 (the remainder shifts into batched cat-file/ls-tree); total git spawns ~14.5k → ~9.3k same-harness. (Exact spawn totals are method-sensitive; the notes show collapse and the wall-time deltas are the robust figures.)

Validation

task fmt, task lint (clippy -D warnings), full task test — green.
Same-run A/B byte-identical check of range_stats on real data across 100- and 1000-commit ranges (range_1000 = 120,390 AI lines across 32 tool/model combos — identical before/after).
Regression tests: batched read ≡ per-commit read (commits with notes, without notes, non-existent); empty-input fast path; and skip-dead-pass gate output-invariance (blame line authors identical with the pass on vs off).
Two independent adversarial reviews (the batching, and the skip-dead-pass gate) — no blockers. Documented parity boundary: on GitNotes the batched read decodes note blobs with from_utf8_lossy (matching the existing CommitAuthorship path) vs the per-commit strict decode — unreachable for real notes (always valid UTF-8 JSON). The Http and batch-error fallback arms delegate per commit to the existing reader (equivalent by construction; not exercised by tests).

Note

The #1630 index pattern still applies — just to a different command, git-ai usage (MetricsDatabase::get_metric_history is a full-table scan that deserializes every row). Not touched here.

🤖 Generated with Claude Code

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

`git-ai stats <a>..<b>` was dominated by git-subprocess fan-out, not by any database work: the range path blames every changed file three times (diff_ai_accepted_stats + start/end VirtualAttributions), and each blame re-read every blamed commit's authorship note with a separate `git notes --ref=ai show` subprocess, uncached across files and passes. On a 100-commit range over this repo that is several thousand `git notes show` calls alone — the large majority of the total git subprocesses. (Profiling confirmed the cost is git subprocesses, not SQLite — the default GitNotes backend touches 0ms of SQLite on this path, so a #1630-style index PR would not help `git-ai stats`.) Two changes, both output-preserving: 1. Batch the note reads. Add `notes_api::read_authorship_v3_batch`, which resolves many commits' notes with one `git ls-tree` + `cat-file --batch` (via `refs::notes_for_commits`) instead of one `git notes show` per commit. It returns exactly what `read_authorship_v3(repo, sha).ok()` yields for each commit; the v3 parse is factored into `refs::parse_reference_as_authorship_log_v3` and shared by both paths, and the Http backend delegates per commit to keep its cache-hit semantics. Both blame overlays (`overlay_ai_authorship` and `populate_ai_human_authors`) pre-seed their existing per-commit cache from one batched read. 2. Skip dead work on the stats path. `blame()` with `no_output` discards the blame hunks, so the `populate_ai_human_authors` pass that annotates them with `ai_human_author` (a second per-commit note read) is wasted there. Add `GitAiBlameOptions::skip_human_author_population` (default false, debug-asserted to only be set on no_output paths) and set it on the two stats blame call sites. It is output-invariant because `overlay_ai_authorship` derives `line_authors` per line independently of hunk grouping/`ai_human_author`. Measured on the real repo (release build), same-run A/B (median of repeated runs, warm-up discarded), `range_stats` byte-identical. The speedup grows with range size because it eliminates per-commit note fan-out, which scales with the number of blamed commits. On `stats HEAD~100..HEAD` the per-commit `git notes show` calls collapse from ~7,200 to ~50 (the remainder shifts into a few batched `cat-file`/`ls-tree` reads); wall time: stats HEAD~100..HEAD 69.0s -> 40.0s (~1.7x; baseline noisy, 69-93s across runs) stats HEAD~1000..HEAD 272.3s -> 110.2s (~2.5x) Adds regression tests: the batched read equals the per-commit read for commits with notes, without notes, and non-existent commits; the empty-input fast path; and the skip-dead-pass gate is output-invariant (blame line authors identical with the pass on vs off). (Tests use the GitNotes backend; the Http delegation and batch-error fallback paths are correct by construction but not exercised.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QYmBGrGCfKDY8NLz17TqxJ

jwiegley · 2026-07-01T17:34:13Z

Speed up git-ai stats range mode by batching authorship-note reads #1652 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

devin-ai-integration Bot reviewed Jun 24, 2026

View reviewed changes

jwiegley force-pushed the johnw/faster-stats branch from a601a01 to fb5c37a Compare June 25, 2026 04:30

jwiegley requested a review from svarlamov June 28, 2026 07:10

jwiegley force-pushed the johnw/faster-stats branch from 82ebba2 to 1160141 Compare July 1, 2026 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up `git-ai stats` range mode by batching authorship-note reads#1652

Speed up `git-ai stats` range mode by batching authorship-note reads#1652
jwiegley wants to merge 1 commit into
mainfrom
johnw/faster-stats

jwiegley commented Jun 24, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

jwiegley commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jwiegley commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes (both output-preserving)

Performance (release build, real repo, same-run A/B, range_stats byte-identical)

Validation

Note

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

jwiegley commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jwiegley commented Jun 24, 2026 •

edited

Loading

Performance (release build, real repo, same-run A/B, `range_stats` byte-identical)