flutter · ericwindmill · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/.github/workflows/dash_evals_module_tests.yml b/.github/workflows/dash_evals_module_tests.yml
@@ -35,6 +35,7 @@ jobs:
         run: |
           source .venv/bin/activate
           pip install --upgrade pip
+          pip install -e ../dataset_config_python
           pip install -e ".[dev]"
 
       - name: Run tests

diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -32,6 +32,7 @@ jobs:
         run: |
           pip install --upgrade pip
           pip install -r docs/requirements.txt
+          pip install -e packages/dataset_config_python
           pip install -e packages/dash_evals
 
       - name: Install Dart dependencies

diff --git a/README.md b/README.md
@@ -11,6 +11,7 @@ This repo includes
 
 - **eval runner** — Python package for running LLM evaluations with configurable tasks, variants, and models
 - **config packages** — Dart and Python packages that resolve dataset YAML into EvalSet JSON for the runner
+  - **NB**: These packages largely overlap, and coexist for backwards compatibility purposes. In time, the Dart package will be deprecated.  
 - **devals CLI** — Dart CLI for creating and managing dataset samples, tasks, and jobs
 - **Evaluation Explorer** — Dart/Flutter app for browsing and analyzing results
 

diff --git a/docs/contributing/packages/dash_evals.md b/docs/contributing/packages/dash_evals.md
@@ -41,9 +41,10 @@ src/dash_evals/
 
 1. **Configure**: The Dart `dataset_config_dart` package parses dataset YAML and resolves it into an `EvalSet` JSON manifest
 2. **Load**: The Python runner reads the JSON manifest via `json_runner.py`, resolving task functions dynamically with `importlib`
-3. **Execute**: Each task function receives its dataset and task definition, producing an `inspect_ai.Task`
-4. **Score**: Scorers evaluate model outputs against targets
-5. **Log**: Results written to the configured `log_dir`
+3. **Hydrate**: Config dicts are converted to Inspect AI objects (datasets, MCP servers, skills) using shared helpers from `dataset_config_python.hydrate`
+4. **Execute**: Each task function receives its dataset and task definition, producing an `inspect_ai.Task`
+5. **Score**: Scorers evaluate model outputs against targets
+6. **Log**: Results written to the configured `log_dir`
 
 Alternatively, the runner can be invoked directly with `--task` and `--model` arguments (via `args_runner.py`), bypassing the Dart config pipeline.
 

diff --git a/docs/contributing/repository_structure.md b/docs/contributing/repository_structure.md
@@ -10,7 +10,7 @@ evals/
 │   ├── devals_cli/             # Dart CLI for managing dataset (devals)
 │   ├── dataset_config_dart/    # Dart library: YAML → EvalSet JSON
 │   ├── dash_evals/             # Python evaluation runner
-│   ├── dataset_config_python/  # Python configuration models
+│   ├── dataset_config_python/  # Python config: YAML → EvalSet JSON + config → Inspect AI objects
 │   └── eval_explorer/          # Dart/Flutter results viewer (Serverpod)
 ├── tool/                       # Utility scripts
 ├── pubspec.yaml                # Dart workspace configuration

diff --git a/docs/guides/about_the_framework.md b/docs/guides/about_the_framework.md
@@ -18,6 +18,7 @@ YAML config → Dart resolver → JSON manifest → Python runner → Inspect AI
 |-------|---------|-------------|
 | **YAML config** | — | Your `task.yaml` and `job.yaml` files |
 | **Dart resolver** | `dataset_config_dart` | Parses YAML, resolves globs and references, produces a JSON manifest |
+| **Hydration** | `dataset_config_python` | Converts config dicts into Inspect AI objects (datasets, MCP servers, skills) |
 | **Python runner** | `dash_evals` | Reads the manifest, builds Inspect AI `Task` objects, calls `eval_set()` |
 | **Inspect AI** | `inspect_ai` | Runs solver chains, sends prompts, collects responses, runs scorers |
 
@@ -148,16 +149,19 @@ calling `submit()`.
 
 ## Shared helpers
 
-The `task_helpers.py` module contains functions used across all tasks:
-
-| Helper | What it does |
-|--------|-------------|
-| `append_context_injection(chain, config)` | Adds a `context_injector` solver if the variant has `files` |
-| `append_model_interaction(chain, config)` | Adds `react()` (if tools exist) or `generate()` (if not) |
-| `get_skill_tool(config)` | Creates a skill tool if the variant has `skills` configured |
-| `build_task_metadata(config)` | Builds the metadata dict for the `Task` object |
-| `create_mcp_servers(configs, sandbox_type)` | Creates MCP server objects from variant config |
-| `validate_sandbox_tools(config, tool_names)` | Checks that sandbox-requiring tools aren't used on local |
+The `task_helpers.py` module contains functions used across all tasks. Some of
+these are re-exported from `dataset_config_python.hydrate` — the shared
+config-interpretation layer that both `dash_evals` and external consumers (like
+yardstick) use to ensure consistent hydration of config into Inspect AI objects.
+
+| Helper | Source | What it does |
+|--------|--------|-------------|
+| `create_mcp_servers(configs, sandbox_type)` | `dataset_config_python` | Creates MCP server objects from variant config |
+| `get_skill_tool(config)` | `dataset_config_python` | Creates a skill tool if the variant has `skills` configured |
+| `build_task_metadata(config)` | `dataset_config_python` | Builds the metadata dict for the `Task` object |
+| `append_context_injection(chain, config)` | `dash_evals` | Adds a `context_injector` solver if the variant has `files` |
+| `append_model_interaction(chain, config)` | `dash_evals` | Adds `react()` (if tools exist) or `generate()` (if not) |
+| `validate_sandbox_tools(config, tool_names)` | `dash_evals` | Checks that sandbox-requiring tools aren't used on local |
 
 These helpers mean that most of the variant logic (context injection, MCP tools,
 skills) is handled **automatically**. You just need to define the core solver

diff --git a/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md b/docs/reference/dart_api/dataset_config_dart/dataset_config_dart.md
@@ -336,7 +336,7 @@ and [`MemoryDataset`](https://inspect.aisi.org.uk/reference/inspect_ai.dataset.h
 #### `Dataset`
 
 ```dart
-Dataset({List<Sample> samples, String? name, String? location, bool shuffled})
+Dataset({List<Sample> samples, String? name, String? location, bool shuffled, String format, String? source, Map<String, dynamic>? args})
 ```
 
 #### `Dataset.fromJson`
@@ -1007,7 +1007,7 @@ inspect_eval_arguments:
 #### `Job`
 
 ```dart
-Job({String? description, required String logDir, int maxConnections, List<String>? models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, Map<String, dynamic>? sandbox, Map<String, dynamic>? inspectEvalArguments, TagFilter? taskFilters, TagFilter? sampleFilters})
+Job({String? description, required String logDir, int maxConnections, required List<String> models, Map<String, Map<String, dynamic>>? variants, List<String>? taskPaths, Map<String, JobTask>? tasks, bool saveExamples, Map<String, dynamic>? sandbox, Map<String, dynamic>? inspectEvalArguments, TagFilter? taskFilters, TagFilter? sampleFilters})
 ```
 
 #### `Job.fromJson`
@@ -1203,7 +1203,7 @@ former `TaskConfig` model-package class.
 #### `ParsedTask`
 
 ```dart
-ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, bool saveExamples, String? examplesDir, Map<String, dynamic>? sandboxParameters, Map<String, String>? taskFiles, String? taskSetup, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata})
+ParsedTask({required String id, required String func, required List<Sample> samples, required Variant variant, String sandboxType, String? systemMessage, bool saveExamples, String? examplesDir, Map<String, dynamic>? sandboxParameters, Map<String, String>? taskFiles, String? taskSetup, String? model, Map<String, dynamic>? config, Map<String, String>? modelRoles, Object? sandbox, Object? approval, Object? epochs, Object? failOnError, bool? continueOnFail, int? messageLimit, int? tokenLimit, int? timeLimit, int? workingLimit, double? costLimit, Object? earlyStopping, String? displayName, Object? version, Map<String, dynamic>? metadata, String datasetFormat, String? datasetSource, Map<String, dynamic>? datasetArgs})
 ```
 
 ### Properties
@@ -1304,6 +1304,18 @@ ParsedTask({required String id, required String func, required List<Sample> samp
 
   Additional metadata to associate with the task.
 
+- **`datasetFormat`** → `String` *(final)*
+
+  Dataset format: 'memory' (inline samples), 'json', or 'csv'.
+
+- **`datasetSource`** → `String?` *(final)*
+
+  File path or URL for json/csv datasets.
+
+- **`datasetArgs`** → `Map<String, dynamic>?` *(final)*
+
+  Extra kwargs passed to json_dataset() or csv_dataset().
+
 ### Methods
 
 #### `copyWith`
@@ -1722,6 +1734,10 @@ Job createDefaultJob(String baseDir)
 
 Create a [Job] with default settings (when no job file is provided).
 
+Note: The caller must specify models, as there are no defaults.
+This method creates a job with an empty models list; the resolver
+will raise an error if models is empty at resolution time.
+
 **Parameters:**
 
 - `baseDir` (`String`) *(required)*

diff --git a/packages/dash_evals/pyproject.toml b/packages/dash_evals/pyproject.toml
@@ -15,6 +15,7 @@ dependencies = [
     "openai>=2.8.1,<3.0.0",
     "firebase-admin>=6.0.0,<8.0.0",
     "pydantic>=2.0.0,<3.0.0",
+    "dataset-config-python",
 ]
 
 [project.optional-dependencies]

diff --git a/packages/dash_evals/src/dash_evals/runner/json_runner.py b/packages/dash_evals/src/dash_evals/runner/json_runner.py
@@ -11,7 +11,7 @@
 from pathlib import Path
 
 import inspect_ai
-from inspect_ai.dataset import MemoryDataset, Sample, csv_dataset, json_dataset
+from dataset_config_python.hydrate import build_dataset as _build_dataset
 
 from dash_evals.utils.logging import capture_output, setup_logging
 
@@ -94,74 +94,8 @@ def _resolve_task_func(name: str):
         return func
 
 
-def _build_dataset(task_def: dict):
-    """Build an Inspect AI dataset from a task definition.
 
-    Dispatches on ``task_def["dataset"]["format"]``:
 
-    - ``"memory"`` (default): builds a ``MemoryDataset`` from inline samples.
-    - ``"json"``: delegates to ``inspect_ai.dataset.json_dataset(source, **args)``.
-    - ``"csv"``: delegates to ``inspect_ai.dataset.csv_dataset(source, **args)``.
-
-    Args:
-        task_def: A task entry from the EvalSet JSON manifest.
-
-    Returns:
-        An Inspect AI dataset object.
-
-    Raises:
-        ValueError: If the dataset format is unrecognized or required fields
-            (e.g. ``source`` for json/csv) are missing.
-    """
-    dataset_def = task_def.get("dataset")
-    task_name = task_def.get("name", "")
-
-    if not dataset_def:
-        return MemoryDataset([], name=task_name)
-
-    fmt = dataset_def.get("format", "memory")
-    extra_args: dict = dataset_def.get("args") or {}
-
-    if fmt == "json":
-        source = dataset_def.get("source")
-        if not source:
-            raise ValueError(
-                f"Task '{task_name}': dataset format 'json' requires a 'source' field."
-            )
-        return json_dataset(source, **extra_args)
-
-    if fmt == "csv":
-        source = dataset_def.get("source")
-        if not source:
-            raise ValueError(
-                f"Task '{task_name}': dataset format 'csv' requires a 'source' field."
-            )
-        return csv_dataset(source, **extra_args)
-
-    if fmt == "memory":
-        raw_samples = dataset_def.get("samples", [])
-        samples = []
-        for raw in raw_samples:
-            sample = Sample(
-                input=raw["input"],
-                target=raw.get("target", ""),
-                id=raw.get("id"),
-                metadata=raw.get("metadata"),
-                files=raw.get("files"),
-                setup=raw.get("setup"),
-                sandbox=raw.get("sandbox"),
-            )
-            samples.append(sample)
-
-        return MemoryDataset(
-            samples,
-            name=dataset_def.get("name", task_name),
-        )
-
-    raise ValueError(
-        f"Task '{task_name}': unknown dataset format '{fmt}'. "
-        f"Expected one of: 'memory', 'json', 'csv'."
-    )
 
 
 def run_from_json(manifest_path: str | Path) -> bool: