You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DEVELOPER.md
+25Lines changed: 25 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,6 +48,31 @@ All tools are currently tested in the [MCP Toolbox GitHub](https://github.com/go
48
48
49
49
The skills themselves are validated using the `skills-validate.yml` workflow.
50
50
51
+
### Automated Skill Evaluations (EvalBench)
52
+
53
+
This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
54
+
55
+
Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live Cloud SQL instance, credentials are securely injected by Secret Manager during CI.
56
+
57
+
#### Understanding Evaluation Files
58
+
59
+
All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
60
+
61
+
***Conversational Dataset (`dataset.json`):** Defines test scenarios for the model. Each scenario contains:
62
+
*`starting_prompt`: The initial prompt sent to the agent.
63
+
*`conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
64
+
*`expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
65
+
***Run Configuration (`run_config.yaml`):** Configures the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
66
+
67
+
#### Maintaining and Adding Scenarios
68
+
69
+
When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset file:
70
+
71
+
1. Open `evals/dataset.json`.
72
+
2. Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
73
+
3. Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
74
+
4. The evaluation pipeline runs securely via Cloud Build. A maintainer will review the internal logs and results to verify your scenarios pass successfully.
75
+
51
76
### Other GitHub Checks
52
77
53
78
***License Header Check:** A workflow ensures all necessary files contain the
0 commit comments