This document summarizes the impact of the using-braintrust skill on Claude's ability to use the Braintrust platform.
We created two end-to-end evaluations that test Claude's ability to:
- Log Fetch: Log data to Braintrust and verify it exists via BTQL queries
- Experiment: Create and run evaluations with scorers
Both evals use the Claude Agent SDK to have Claude actually execute Python code, then verify the results by querying Braintrust directly (not trusting Claude's text output).
| Eval | Metric | Before Skill | After Skill | Improvement |
|---|---|---|---|---|
| Log Fetch | Logs Created | 67% | 100% | +33% |
| Log Fetch | Correct Count | 67% | 100% | +33% |
| Log Fetch | Task Completed | 67% | 100% | +33% |
| Experiment | Experiments Created | 100% | 100% | — |
| Experiment | Eval Ran | 100% | 100% | — |
| Experiment | Task Completed | 0% | 100% | +100% |
Without the skill, Claude consistently made this error:
# What Claude tried (WRONG)
braintrust.Eval(project_name="My Project", data=..., task=..., scores=...)
# Error: TypeError: Eval() got an unexpected keyword argument 'project_name'The skill documentation explicitly shows the correct usage:
# Correct usage from SKILL.md
braintrust.Eval(
"My Project", # Project name is FIRST POSITIONAL argument
data=lambda: [...],
task=lambda input: ...,
scores=[my_scorer],
)Tests Claude's ability to:
- Initialize a Braintrust logger
- Log entries with metadata
- Call
flush()to ensure data is sent
Baseline (without skill):
- Logs Created: 67% — Claude sometimes forgot
flush()or used wrong API - Correct Count: 67% — Not always logging the expected number of entries
- Task Completed: 67% — Some executions had errors
With skill:
- Logs Created: 100% ✅
- Correct Count: 100% ✅
- Task Completed: 100% ✅
Tests Claude's ability to:
- Create test data
- Write a task function
- Use autoevals scorers
- Run
braintrust.Eval()
Baseline (without skill):
- Experiments Created: 100% — Claude did create experiments
- Eval Ran: 100% — The evals did execute
- Task Completed: 0% — But every execution had the
project_nameTypeError
With skill:
- Experiments Created: 100% ✅
- Eval Ran: 100% ✅
- Task Completed: 100% ✅
The SKILL.md file includes:
- Correct API signatures — Shows that
Eval()takes a positional project name - Working examples — Copy-paste code that actually works
- Common pitfalls — Explicitly warns about the
project_namemistake - API reference — Documents
Eval(),init_logger(),Score, etc.
Both evals verify results by querying Braintrust directly:
- Log Fetch: Uses BTQL to count logs with a specific
test_idin metadata - Experiment: Uses the API to count experiments created in the test project
This ensures we're measuring actual success, not just Claude claiming success.
- Log Fetch (with skill): https://www.braintrust.dev/app/braintrustdata.com/p/Braintrust%20Skill%20-%20E2E%20Log%20Fetch/experiments/add-evals-1766272101
- Experiment (with skill): https://www.braintrust.dev/app/braintrustdata.com/p/Braintrust%20Skill%20-%20E2E%20Experiment/experiments/add-evals-1766271997
The skill improved Claude's success rate from 67-0% to 100% on critical Braintrust operations. The key insight is that LLMs need explicit documentation of API patterns, especially when:
- The API uses positional arguments in unexpected ways
- There are required cleanup steps (like
flush()) - The error messages don't clearly indicate the fix
This validates the evaluation-driven development approach: we identified real gaps by running evals first, then created targeted documentation to address those specific failures.