Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/dash_evals_module_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ jobs:
run: |
source .venv/bin/activate
pip install --upgrade pip
pip install -e ../dataset_config_python
pip install -e ".[dev]"

- name: Run tests
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -r docs/requirements.txt
pip install -e packages/dataset_config_python
pip install -e packages/dash_evals

- name: Install Dart dependencies
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This repo includes

- **eval runner** — Python package for running LLM evaluations with configurable tasks, variants, and models
- **config packages** — Dart and Python packages that resolve dataset YAML into EvalSet JSON for the runner
- **NB**: These packages largely overlap, and coexist for backwards compatibility purposes. In time, the Dart package will be deprecated.
- **devals CLI** — Dart CLI for creating and managing dataset samples, tasks, and jobs
- **Evaluation Explorer** — Dart/Flutter app for browsing and analyzing results

Expand Down
7 changes: 4 additions & 3 deletions docs/contributing/packages/dash_evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,10 @@ src/dash_evals/

1. **Configure**: The Dart `dataset_config_dart` package parses dataset YAML and resolves it into an `EvalSet` JSON manifest
2. **Load**: The Python runner reads the JSON manifest via `json_runner.py`, resolving task functions dynamically with `importlib`
3. **Execute**: Each task function receives its dataset and task definition, producing an `inspect_ai.Task`
4. **Score**: Scorers evaluate model outputs against targets
5. **Log**: Results written to the configured `log_dir`
3. **Hydrate**: Config dicts are converted to Inspect AI objects (datasets, MCP servers, skills) using shared helpers from `dataset_config_python.hydrate`
4. **Execute**: Each task function receives its dataset and task definition, producing an `inspect_ai.Task`
5. **Score**: Scorers evaluate model outputs against targets
6. **Log**: Results written to the configured `log_dir`

Alternatively, the runner can be invoked directly with `--task` and `--model` arguments (via `args_runner.py`), bypassing the Dart config pipeline.

Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/repository_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ evals/
│ ├── devals_cli/ # Dart CLI for managing dataset (devals)
│ ├── dataset_config_dart/ # Dart library: YAML → EvalSet JSON
│ ├── dash_evals/ # Python evaluation runner
│ ├── dataset_config_python/ # Python configuration models
│ ├── dataset_config_python/ # Python config: YAML → EvalSet JSON + config → Inspect AI objects
│ └── eval_explorer/ # Dart/Flutter results viewer (Serverpod)
├── tool/ # Utility scripts
├── pubspec.yaml # Dart workspace configuration
Expand Down
24 changes: 14 additions & 10 deletions docs/guides/about_the_framework.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI
|-------|---------|-------------|
| **YAML config** | — | Your `task.yaml` and `job.yaml` files |
| **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest |
| **Hydration** | `dataset_config_python` | Converts config dicts into Inspect AI objects (datasets, MCP servers, skills) |
| **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` |
| **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers |

Expand Down Expand Up @@ -148,16 +149,19 @@ calling `submit()`.

## Shared helpers

The `task_helpers.py` module contains functions used across all tasks:

| Helper | What it does |
|--------|-------------|
| `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` |
| `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) |
| `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured |
| `build_task_metadata(config)` | Builds the metadata dict for the `Task` object |
| `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config |
| `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local |
The `task_helpers.py` module contains functions used across all tasks. Some of
these are re-exported from `dataset_config_python.hydrate` — the shared
config-interpretation layer that both `dash_evals` and external consumers (like
yardstick) use to ensure consistent hydration of config into Inspect AI objects.

| Helper | Source | What it does |
|--------|--------|-------------|
| `create_mcp_servers(configs, sandbox_type)` | `dataset_config_python` | Creates MCP server objects from variant config |
| `get_skill_tool(config)` | `dataset_config_python` | Creates a skill tool if the variant has `skills` configured |
| `build_task_metadata(config)` | `dataset_config_python` | Builds the metadata dict for the `Task` object |
| `append_context_injection(chain, config)` | `dash_evals` | Adds a `context_injector` solver if the variant has `files` |
| `append_model_interaction(chain, config)` | `dash_evals` | Adds `react()` (if tools exist) or `generate()` (if not) |
| `validate_sandbox_tools(config, tool_names)` | `dash_evals` | Checks that sandbox-requiring tools aren't used on local |

These helpers mean that most of the variant logic (context injection, MCP tools,
skills) is handled **automatically**. You just need to define the core solver
Expand Down
22 changes: 19 additions & 3 deletions docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,7 +336,7 @@ and [`MemoryDataset`](https://inspect.aisi.org.uk/reference/inspect_ai.dataset.h
#### `Dataset`

```dart
Dataset({List<Sample> samples, String? name, String? location, bool shuffled})
Dataset({List<Sample> samples, String? name, String? location, bool shuffled, String format, String? source, Map<String, dynamic>? args})
```

#### `Dataset.fromJson`
Expand Down Expand Up @@ -1007,7 +1007,7 @@ inspect_eval_arguments:
#### `Job`

```dart
Job({String? description, required String logDir, int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, Map<String, dynamic>? sandbox, Map<String, dynamic>? inspectEvalArguments, TagFilter? taskFilters, TagFilter? sampleFilters})
Job({String? description, required String logDir, int maxConnections, required List<String> models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, Map<String, dynamic>? sandbox, Map<String, dynamic>? inspectEvalArguments, TagFilter? taskFilters, TagFilter? sampleFilters})
```

#### `Job.fromJson`
Expand Down Expand Up @@ -1203,7 +1203,7 @@ former `TaskConfig` model-package class.
#### `ParsedTask`

```dart
ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, bool saveExamples, String? examplesDir, Map<String, dynamic>? sandboxParameters, Map<String, String>? taskFiles, String? taskSetup, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, bool saveExamples, String? examplesDir, Map<String, dynamic>? sandboxParameters, Map<String, String>? taskFiles, String? taskSetup, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata, String datasetFormat, String? datasetSource, Map<String, dynamic>? datasetArgs})
```

### Properties
Expand Down Expand Up @@ -1304,6 +1304,18 @@ ParsedTask({required String id, required String func, required List<Sample> samp

Additional metadata to associate with the task.

- **`datasetFormat`** → `String` *(final)*

Dataset format: 'memory' (inline samples), 'json', or 'csv'.

- **`datasetSource`** → `String?` *(final)*

File path or URL for json/csv datasets.

- **`datasetArgs`** → `Map<String, dynamic>?` *(final)*

Extra kwargs passed to json_dataset() or csv_dataset().

### Methods

#### `copyWith`
Expand Down Expand Up @@ -1722,6 +1734,10 @@ Job createDefaultJob(String baseDir)

Create a [Job] with default settings (when no job file is provided).

Note: The caller must specify models, as there are no defaults.
This method creates a job with an empty models list; the resolver
will raise an error if models is empty at resolution time.

**Parameters:**

- `baseDir` (`String`) *(required)*
Expand Down
1 change: 1 addition & 0 deletions packages/dash_evals/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies = [
"openai>=2.8.1,<3.0.0",
"firebase-admin>=6.0.0,<8.0.0",
"pydantic>=2.0.0,<3.0.0",
"dataset-config-python",
]

[project.optional-dependencies]
Expand Down
68 changes: 1 addition & 67 deletions packages/dash_evals/src/dash_evals/runner/json_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from pathlib import Path

import inspect_ai
from inspect_ai.dataset import MemoryDataset, Sample, csv_dataset, json_dataset
from dataset_config_python.hydrate import build_dataset as _build_dataset

from dash_evals.utils.logging import capture_output, setup_logging

Expand Down Expand Up @@ -94,74 +94,8 @@ def _resolve_task_func(name: str):
return func


def _build_dataset(task_def: dict):
"""Build an Inspect AI dataset from a task definition.

Dispatches on ``task_def["dataset"]["format"]``:

- ``"memory"`` (default): builds a ``MemoryDataset`` from inline samples.
- ``"json"``: delegates to ``inspect_ai.dataset.json_dataset(source, **args)``.
- ``"csv"``: delegates to ``inspect_ai.dataset.csv_dataset(source, **args)``.

Args:
task_def: A task entry from the EvalSet JSON manifest.

Returns:
An Inspect AI dataset object.

Raises:
ValueError: If the dataset format is unrecognized or required fields
(e.g. ``source`` for json/csv) are missing.
"""
dataset_def = task_def.get("dataset")
task_name = task_def.get("name", "")

if not dataset_def:
return MemoryDataset([], name=task_name)

fmt = dataset_def.get("format", "memory")
extra_args: dict = dataset_def.get("args") or {}

if fmt == "json":
source = dataset_def.get("source")
if not source:
raise ValueError(
f"Task '{task_name}': dataset format 'json' requires a 'source' field."
)
return json_dataset(source, **extra_args)

if fmt == "csv":
source = dataset_def.get("source")
if not source:
raise ValueError(
f"Task '{task_name}': dataset format 'csv' requires a 'source' field."
)
return csv_dataset(source, **extra_args)

if fmt == "memory":
raw_samples = dataset_def.get("samples", [])
samples = []
for raw in raw_samples:
sample = Sample(
input=raw["input"],
target=raw.get("target", ""),
id=raw.get("id"),
metadata=raw.get("metadata"),
files=raw.get("files"),
setup=raw.get("setup"),
sandbox=raw.get("sandbox"),
)
samples.append(sample)

return MemoryDataset(
samples,
name=dataset_def.get("name", task_name),
)

raise ValueError(
f"Task '{task_name}': unknown dataset format '{fmt}'. "
f"Expected one of: 'memory', 'json', 'csv'."
)


def run_from_json(manifest_path: str | Path) -> bool:
Expand Down
Loading
Loading