An episodic-memory diagnostic for LLM context systems. Point it at your own memory/retrieval stack and see where it forgets a source, misses a correction, or loses the thread. Report and leaderboard-free results live at tulv.ing.
T-bench is an X-ray, not a photo-finish. It is built to diagnose systems — including ours — not to crown a winner. The dataset, the harness, and the report are all in this one repository. Fork it and run it on your system.
Conflict of interest, stated up front: T-bench is built by Compresh. Our own systems appear under the same conditions as every other arm, and we publish a pre-registered hypothesis about our newest system that failed (see Honesty). Hiding that would defeat the point.
The value of a context architecture is inversely proportional to the strength of the answering model.
- On a strong model, retrieval matches full-context quality at ~99% fewer tokens — the win is cost, not accuracy.
- On a small open model, the same retrieval lifts accuracy by ~22 points.
- On the cheapest models, the raw long history does not even fit — retrieval is the only way the task runs at all.
| Arm | Overall | F1 src | F2 corr | F3 time | F4 bind | F5 calib | ctx (tok) | saving |
|---|---|---|---|---|---|---|---|---|
| raw (control) | 0.955 | 1.000 | 1.000 | 0.828 | 0.932 | 0.983 | 31,947 | — |
| naive RAG (baseline) | 0.900 | 0.964 | 1.000 | 0.667 | 0.792 | 0.975 | 270 | 99.2% |
| oracle (upper bound) | 0.985 | 0.951 | 1.000 | 0.984 | 0.990 | 1.000 | 35 | 99.9% |
| blind-summarization | 0.991 | 0.991 | 1.000 | 1.000 | 0.958 | 0.988 | 38,061 | −19.1% |
| retrieval (TUL 2.0) | 0.988 | 0.973 | 1.000 | 1.000 | 0.969 | 0.988 | 275 | 99.1% |
| retrieval + role/anchor (TUL 2.1) | 0.990 | 0.982 | 1.000 | 1.000 | 0.969 | 0.988 | 297 | 99.1% |
raw quality falls with length (S 0.995 → L 0.917); retrieval stays flat (~0.99).
At retrieval's budget (~270 tok), naive RAG scores only 0.900 — below raw — losing
on the temporal and binding families (F3 0.667, F4 0.792).
Full analysis, the weak-model curve, and the methodology: tulv.ing.
git clone https://github.com/compresh/tbench
cd tbench/runner
pip install -r requirements.txt
python run_tbench.py --data ../data/v1.1 --arm your_system --out results/you.jsonlScoring is deterministic (an LLM judge is used for one free-text sub-probe only), so your numbers are reproducible. Then:
- Share + discuss what you found in Discussions.
- Contribute results by opening a PR that adds your
.jsonltoresults/(see CONTRIBUTING). This is collaborative, not competitive — we want to see where your system's X-ray differs from ours.
T-bench is a living benchmark — we keep adding tests, models, and harder probes over time, and we are glad to publish results, ours and the community's.
Functional probes inspired by Tulving's episodic-memory framework (not a claim that any system "satisfies Tulving's criteria" — we do not measure autonoesis):
| Family | Probes |
|---|---|
| F1 — source attribution | who said it: user, assistant, or third party; which turn |
| F2 — correction fidelity | the current value and recall of the superseded one |
| F3 — relational time | ordering, dangling-reference resolution |
| F4 — what-where-when binding | episode individuation, co-occurrence, counting |
| F5 — remember/know | calibration: hearsay flags, abstention, false-memory rejection |
- Pre-registered hypotheses and thresholds, written before each run in an append-only registry. Thresholds are never moved after seeing results, and negative results are published.
- Deterministic scoring from a generating graph — reproducibility does not depend on a judge model.
- Synthetic, graph-based generation → contamination is controllable (public split + private held-out split, regenerable).
A pre-registered hypothesis (H1) predicted our newest system would beat plain retrieval by ≥ +5pp on source attribution. Measured: +0.9pp. Rejected. The honest read is that role/anchor rendering is a net-positive, zero-cost change — decisive on weak models, marginal on strong ones — not the across-the-board win we guessed. T-bench has also caught several of our own bugs (a mis-firing temporal anchor; two silent fallbacks) before they shipped. An instrument that never catches its builder is not an instrument.
tbench/
├── docs/ the tulv.ing report page (GitHub Pages, served from /docs)
├── data/ T-bench dataset (versioned; public + private held-out)
├── runner/ the harness — pip install + one command
├── results/ our results + community-contributed results (via PR)
└── CONTRIBUTING.md
Code: MIT (see LICENSE), aligned with other Compresh repos. Dataset: published
as measured, CC BY-SA 4.0 with attribution — it includes StackExchange-derived
distractor text. See ATTRIBUTION.md and data/README.md.
Built by Compresh. Report: tulv.ing. We are early in this area too — if T-bench tells you something we got wrong, that is the instrument working.