Measures whether semantic lifting improves search localization accuracy in RPG-Encoder.
Acc@k — Does the correct file appear in the top-k search results for a natural-language intent query?
MRR (Mean Reciprocal Rank) — Average of 1/rank across all queries. Higher = better.
Two search modes compared:
- Unlifted (
--mode snippets): Keyword/snippet matching only (structural graph) - Lifted (
--mode auto): Semantic features + keyword matching (merged scores)
The benchmark uses the rpg-encoder repository itself as the test target:
| Repo | Language | Entities | Queries |
|---|---|---|---|
rpg-encoder |
Rust | 855 | 39 |
Each query is a natural-language intent with expected file path substrings:
{
"query": "extract semantic features from code with LLM",
"expect": ["semantic_lifting.rs"]
}# Prerequisites
cargo build --release # Build rpg-encoder
# Re-run measurement only (fast, uses cached graphs)
python3 benchmarks/search_quality.py --measure-only
# Full benchmark with lifting (uses connected coding agent or API key)
python3 benchmarks/search_quality.py
# Force re-lift all entities
python3 benchmarks/search_quality.py --force-lift855/855 entities lifted (100% coverage).
Metric Unlifted Lifted Delta
──────── ────────────── ────────────── ────────
Acc@1 13/39 (33%) 19/39 (49%) +15%
Acc@3 19/39 (49%) 26/39 (67%) +18%
Acc@5 19/39 (49%) 27/39 (69%) +21%
Acc@10 20/39 (51%) 33/39 (85%) +33%
MRR 0.409 0.589 +0.181
MRR delta: +0.181 (95% CI [+0.012, +0.356])
Lifting improves Acc@1 by +15%, Acc@5 by +21%, and Acc@10 by +33%. The MRR improvement is statistically significant (95% CI does not cross zero).
Note
These results use lexical-only search via the CLI binary. The MCP server uses hybrid
embedding + lexical search (BGE-small-en-v1.5, 0.6 semantic + 0.4 lexical blending),
which would produce even higher accuracy. The CLI does not yet enable the embeddings
feature — see Feature gap below.
Notable per-query improvements with lifting (20 total):
- "build token-aware entity batches": @10 -> @1
- "parse Rust functions and structs": @3 -> @1
- "detect file changes from git diff": @2 -> @1
- "incremental update from code modifications": @2 -> @1
- "serialize output in TOON format": miss -> @1
- "configure batch size and encoding settings": miss -> @1
- "parse pipe-delimited line format features": miss -> @1
- "resolve scope specification to entity IDs": miss -> @1
- "propagate dependency features bottom-up": miss -> @1
- "format search results as TOON output": miss -> @1
- "strip LLM think blocks from response": miss -> @1
The benchmark uses the CLI binary (rpg-encoder search), which only performs lexical keyword matching. The MCP server (rpg-mcp-server) additionally uses fastembed (BGE-small-en-v1.5) for hybrid embedding + lexical search with 0.6/0.4 blending.
This gap exists because:
rpg-cli/Cargo.tomldepends onrpg-navwithout theembeddingsfeaturerpg-mcp/Cargo.tomldepends onrpg-navwithfeatures = ["embeddings"]- The CLI's
cmd_searchpassesembedding_scores: Nonetosearch_with_params
The benchmark therefore measures a lower bound on lifted search quality. MCP users get hybrid search automatically.
The benchmark has two phases:
- PREPARE (slow, cached): Copy repo, build graph, lift entities. Results cached in
/tmp/rpg-bench/rpg-encoder/.rpg/. - MEASURE (fast, reproducible): Run search queries against cached graphs, compute Acc@k and MRR.
This separation means you only pay the lifting cost once. Subsequent runs with --measure-only complete in seconds.
# Clean slate
rm -rf /tmp/rpg-bench
# Full reproducible run
python3 benchmarks/search_quality.py 2>&1 | tee benchmarks/run.log
# Results saved to benchmarks/results.json